Knowledge Discovery in Databases with Diversity of Data Types

Data mining and knowledge discovery aim at finding useful information from typically massive collections of data, and then extracting useful knowledge from the information. To date a large number of approaches have been proposed to find useful information and discover useful knowledge; for example, decision trees, Bayesian belief networks, evidence theory, rough set theory, fuzzy set theory, kNN (k-nearest-neighborhood) classifier, neural networks, and support vector machines. However, these approaches are based on a specific data type. In the real world, an intelligent system often encounters mixed data types, incomplete information (missing values), and imprecise information (fuzzy conditions). In the UCI (University of California – Irvine) Machine Learning Repository, it can be seen that there are many real world data sets with missing values and mixed data types. It is a challenge to enable machine learning or data mining approaches to deal with mixed data types (Ching, 1995; Coppock, 2003) because there are difficulties in finding a measure of similarity between objects with mixed data type attributes. The problem with mixed data types is a long-standing issue faced in data mining. The emerging techniques targeted at this issue can be classified into three classes as follows: (1) Symbolic data mining approaches plus different discretizers (e.g., Dougherty et al., 1995; Wu, 1996; Kurgan et al., 2004; Diday, 2004; Darmont et al., 2006; Wu et al., 2007) for transformation from continuous data to symbolic data; (2) Numerical data mining approaches plus transformation from symbolic data to numerical data (e.g.,, Kasabov, 2003; Darmont et al., 2006; Hadzic et al., 2007); (3) Hybrid of symbolic data mining approaches and numerical data mining approaches (e.g.,, Tung, 2002; Kasabov, 2003; Leng et al., 2005; Wu et al., 2006). Since hybrid approaches have the potential to exploit the advantages from both symbolic data mining and numerical data mining approaches, this chapter, after discassing the merits and shortcomings of current approaches, focuses on applying Self-Organizing Computing Network Model to construct a hybrid system to solve the problems of knowledge discovery from databases with a diversity of data types. Future trends for data mining on mixed type data are then discussed. Finally a conclusion is presented.

Download Full-text

Comparison of Linguistic Summaries and Fuzzy Functional Dependencies Related to Data Mining

Advances in Data Mining and Database Management - Biologically-Inspired Techniques for Knowledge Discovery and Data Mining ◽

10.4018/978-1-4666-6078-6.ch008 ◽

2014 ◽

pp. 174-203 ◽

Cited By ~ 4

Author(s):

Miroslav Hudec ◽

Miljan Vučetić ◽

Mirko Vujošević

Keyword(s):

Data Mining ◽

Fuzzy Logic ◽

Relational Databases ◽

Missing Values ◽

Expert Knowledge ◽

Real Data ◽

Research Area ◽

Functional Dependencies ◽

Useful Knowledge ◽

Important Research Area

Data mining methods based on fuzzy logic have been developed recently and have become an increasingly important research area. In this chapter, the authors examine possibilities for discovering potentially useful knowledge from relational database by integrating fuzzy functional dependencies and linguistic summaries. Both methods use fuzzy logic tools for data analysis, acquiring, and representation of expert knowledge. Fuzzy functional dependencies could detect whether dependency between two examined attributes in the whole database exists. If dependency exists only between parts of examined attributes' domains, fuzzy functional dependencies cannot detect its characters. Linguistic summaries are a convenient method for revealing this kind of dependency. Using fuzzy functional dependencies and linguistic summaries in a complementary way could mine valuable information from relational databases. Mining intensities of dependencies between database attributes could support decision making, reduce the number of attributes in databases, and estimate missing values. The proposed approach is evaluated with case studies using real data from the official statistics. Strengths and weaknesses of the described methods are discussed. At the end of the chapter, topics for further research activities are outlined.

Download Full-text

Big Data Models and the Public Sector

Web Services ◽

10.4018/978-1-5225-7501-6.ch007 ◽

2019 ◽

pp. 105-126

Author(s):

N. Nawin Sona

Keyword(s):

Machine Learning ◽

Data Mining ◽

Big Data ◽

Public Sector ◽

Predictive Analytics ◽

Data Types ◽

The Public ◽

Emerging Trends ◽

Wide Range ◽

Learning Data

This chapter aims to give an overview of the wide range of Big Data approaches and technologies today. The data features of Volume, Velocity, and Variety are examined against new database technologies. It explores the complexity of data types, methodologies of storage, access and computation, current and emerging trends of data analysis, and methods of extracting value from data. It aims to address the need for clarity regarding the future of RDBMS and the newer systems. And it highlights the methods in which Actionable Insights can be built into public sector domains, such as Machine Learning, Data Mining, Predictive Analytics and others.

Download Full-text

Heterogeneous Text and Numerical Data Mining with Possible Applications in Business and Financial Sectors

Data Mining ◽

10.4018/978-1-4666-2455-9.ch042 ◽

2013 ◽

pp. 816-836

Author(s):

Farid Bourennani ◽

Shahryar Rahnamayan

Keyword(s):

Data Mining ◽

Quantitative Data ◽

World Wide ◽

Numerical Data ◽

Heterogeneous Data ◽

Research Centers ◽

Unified Approach ◽

Data Types ◽

Qualitative And Quantitative ◽

Uniform Manner

Nowadays, many world-wide universities, research centers, and companies share their own data electronically. Naturally, these data are from heterogeneous types such as text, numerical data, multimedia, and others. From user side, this data should be accessed in a uniform manner, which implies a unified approach for representing and processing data. Furthermore, unified processing of the heterogeneous data types can lead to richer semantic results. In this chapter, we present a unified pre-processing approach that leads to generation of richer semantics of qualitative and quantitative data.

Download Full-text

Data Mining and Knowledge Discovery

Business Intelligence in the Digital Economy ◽

10.4018/978-1-59140-206-0.ch003 ◽

2011 ◽

pp. 35-47 ◽

Cited By ~ 2

Author(s):

Andi Baritchi

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Business Processes ◽

Knowledge Discovery In Databases ◽

Business World ◽

Useful Knowledge ◽

Customer Support ◽

Customer Base ◽

Sales And Marketing ◽

Customer Profiling

In today’s business world, the use of computers for everyday business processes and data recording has become virtually ubiquitous. With the advent of this electronic age comes one priceless by-product — data. As more and more executives are discovering each day, companies can harness data to gain valuable insights into their customer base. Data mining is the process used to take these immense streams of data and reduce them to useful knowledge. Data mining has limitless applications, including sales and marketing, customer support, knowledge-base development, not to mention fraud detection for virtually any field, etc. “Data mining,” a bit of a misnomer, refers to mining the data to find the gems hidden inside the data, and as such it is the most often-used reference to this process. It is important to note, however, that data mining is only one part of the Knowledge Discovery in Databases process, albeit it is the workhorse. In this chapter, we provide a concise description of the Knowledge Discovery process, from domain analysis and data selection, to data preprocessing and transformation, to the data mining itself, and finally the interpretation and evaluation of the results as applied to the domain. We describe the different flavors of data mining, including association rules, classification and prediction, clustering and outlier analysis, customer profiling, and how each of these can be used in practice to improve a business’ understanding of its customers. We introduce the reader to some of today’s hot data mining resources, and then for those that are interested, at the end of the chapter we provide a concise technical overview of how each data-mining technology works.

Download Full-text

Expression and Processing of Inductive Queries

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch055 ◽

2009 ◽

pp. 518-526

Author(s):

Edgard Benítez-Guerrero ◽

Omar Nieva-García

Keyword(s):

Data Mining ◽

Knowledge Discovery ◽

Pattern Discovery ◽

Data Preprocessing ◽

Knowledge Discovery In Databases ◽

Digital Information ◽

Interactive Process ◽

Useful Knowledge ◽

Knowledge Based ◽

Computer Based

The vast amounts of digital information stored in databases and other repositories represent a challenge for finding useful knowledge. Traditionalmethods for turning data into knowledge based on manual analysis reach their limits in this context, and for this reason, computer-based methods are needed. Knowledge Discovery in Databases (KDD) is the semi-automatic, nontrivial process of identifying valid, novel, potentially useful, and understandable knowledge (in the form of patterns) in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). KDD is an iterative and interactive process with several steps: understanding the problem domain, data preprocessing, pattern discovery, and pattern evaluation and usage. For discovering patterns, Data Mining (DM) techniques are applied.

Download Full-text

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Journal of Data and Information Quality ◽

10.1145/3301294 ◽

2019 ◽

Vol 11 (2) ◽

pp. 1-22

Author(s):

Alina Lazar ◽

Ling Jin ◽

C. Anna Spurlock ◽

Kesheng Wu ◽

Alex Sim ◽

...

Keyword(s):

Missing Values ◽

Mixed Data ◽

Data Types ◽

Sequence Clustering

Download Full-text

Application of Rough Set Theory to Data Mining of Condenser Diagnosis in Power Plants

2003 International Joint Power Generation Conference ◽

10.1115/ijpgc2003-40135 ◽

2003 ◽

Author(s):

Zhongguang Fu ◽

Tao Jin ◽

Kun Yang

Keyword(s):

Data Mining ◽

Set Theory ◽

Rough Set ◽

Power Plants ◽

Rough Set Theory ◽

A Priori ◽

Reduction Algorithm ◽

Useful Knowledge ◽

Fault Features ◽

Basic Concepts

Rough set theory is a powerful tool in deal with vagueness and uncertainty. It is particularly suitable to discover hidden and potentially useful knowledge in data and can be used to reduce features and extract rules. This paper introduces the basic concepts and fundamental elements of the rough set theory. A reduction algorithm that integrates a priori with significance is proposed to illustrate how the rough set theory could be used to extract fault features of the condenser in a power plant. Two testing examples are then presented to demonstrate the effectiveness of the theory in fault diagnosis.

Download Full-text

Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets

Journal of Healthcare Engineering ◽

10.1155/2018/1817479 ◽

2018 ◽

Vol 2018 ◽

pp. 1-9 ◽

Cited By ~ 7

Author(s):

Min-Wei Huang ◽

Wei-Chao Lin ◽

Chih-Fong Tsai

Keyword(s):

Missing Values ◽

Positive Impact ◽

Numerical Data ◽

Data Type ◽

Mixed Data ◽

Instance Selection ◽

Missing Value ◽

Missing Value Imputation ◽

Noisy Information ◽

Selection Algorithms

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

Download Full-text

Heterogeneous Text and Numerical Data Mining with Possible Applications in Business and Financial Sectors

Semantic Technologies for Business and Information Systems Engineering ◽

10.4018/978-1-60960-126-3.ch004 ◽

2011 ◽

pp. 60-80

Author(s):

Farid Bourennani ◽

Shahryar Rahnamayan

Keyword(s):

Data Mining ◽

Quantitative Data ◽

World Wide ◽

Numerical Data ◽

Heterogeneous Data ◽

Research Centers ◽

Unified Approach ◽

Data Types ◽

Qualitative And Quantitative ◽

Uniform Manner

Download Full-text

Mining Data with Group Theoretical Means

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch195 ◽

2011 ◽

pp. 1257-1261

Author(s):

Gabriele Kern-Isberner

Keyword(s):

Data Mining ◽

Information Theory ◽

Bayesian Networks ◽

Knowledge Discovery ◽

Association Rules ◽

Structural Information ◽

Inductive Reasoning ◽

Step Process ◽

Useful Knowledge

Knowledge discovery refers to the process of extracting new, interesting, and useful knowledge from data and presenting it in an intelligible way to the user. Roughly, knowledge discovery can be considered a three-step process: preprocessing data; data mining, in which the actual exploratory work is done; and interpreting the results to the user. Here, I focus on the data-mining step, assuming that a suitable set of data has been chosen properly. The patterns that we search for in the data are plausible relationships, which agents may use to establish cognitive links for reasoning. Such plausible relationships can be expressed via association rules. Usually, the criteria to judge the relevance of such rules are either frequency based (Bayardo & Agrawal, 1999) or causality based (for Bayesian networks, see Spirtes, Glymour, & Scheines, 1993). Here, I will pursue a different approach that aims at extracting what can be regarded as structures of knowledge — relationships that may support the inductive reasoning of agents and whose relevance is founded on information theory. The method that I will sketch in this article takes numerical relationships found in data and interprets these relationships as structural ones, using mostly algebraic techniques to elaborate structural information.

Download Full-text