Knowledge Discovery and Data Mining
Latest Publications


TOTAL DOCUMENTS

13
(FIVE YEARS 0)

H-INDEX

3
(FIVE YEARS 0)

Published By IGI Global

9781599042527, 9781599042541

Author(s):  
Elena Irina Neaga

This chapter deals with a roadmap on the bidirectional interaction and support between knowledge discovery (Kd) processes and ontology engineering (Onto) mainly directed to provide refined models using common methodologies. This approach provides a holistic literature review required for the further definition of a comprehensive framework and an associated meta-methodology (Kd4onto4dm) based on the existing theories, paradigms, and practices regarding knowledge discovery and ontology engineering as well as closely related areas such as knowledge engineering, machine/ontology learning, standardization issues and architectural models. The suggested framework may adhere to the Iso-reference model for open distributed processing and Omg-model-driven architecture, and associated dedicated software architectures should be defined.


Author(s):  
Jose Ma. J. Alvir ◽  
Javier Cabrera

Mining clinical trails is becoming an important tool for extracting information that might help design better clinical trials. One important objective is to identify characteristics of a subset of cases that responds substantially differently than the rest. For example, what are the characteristics of placebo respondents? Who have the best or worst response to a particular treatment? Are there subsets among the treated group who perform particularly well? In this chapter we give an overview of the processes of conducting clinical trials and the places where data mining might be of interest. We also introduce an algorithm for constructing data mining trees that are very useful for answering the above questions by detecting interesting features of the data. We illustrate the ARF method with an analysis of data from four placebo-controlled trials of ziprasidone in schizophrenia.


Author(s):  
Malcolm J. Beynon

The efficacy of data mining lies in its ability to identify relationships amongst data. This chapter investigates that constraining this efficacy is the quality of the data analysed, including whether the data is imprecise or in the worst case incomplete. Through the description of Dempster-Shafer theory (DST), a general methodology based on uncertain reasoning, it argues that traditional data mining techniques are not structured to handle such imperfect data, instead requiring the external management of missing values, and so forth. One DST based technique is classification and ranking belief simplex (CaRBS), which allows intelligent data mining through the acceptance of missing values in the data analysed, considering them a factor of ignorance, and not requiring their external management. Results presented here, using CaRBS and a number of simplex plots, show the effect of managing and not managing of imperfect data.


Author(s):  
Jason H. Moore

Human genetics is an evolving discipline that is being driven by rapid advances in technologies that make it possible to measure enormous quantities of genetic information. An important goal of human genetics is to understand the mapping relationship between interindividual variation in DNA sequences (i.e., the genome) and variability in disease susceptibility (i.e., the phenotype). The focus of the present study is the detection and characterization of nonlinear interactions among DNA sequence variations in human populations using data mining and machine learning methods. We first review the concept difficulty and then review a multifactor dimensionality reduction (MDR) approach that was developed specifically for this domain. We then present some ideas about how to scale the MDR approach to datasets with thousands of attributes (i.e., genome-wide analysis). Finally, we end with some ideas about how nonlinear genetic models might be statistically interpreted to facilitate making biological inferences.


Author(s):  
Naeem Seliya ◽  
Taghi M. Khoshgoftaar

In machine learning the problem of limited data for supervised learning is a challenging problem with practical applications. We address a similar problem in the context of software quality modeling. Knowledge- based software engineering includes the use of quantitative software quality estimation models. Such models are trained using apriori software quality knowledge in the form of software metrics and defect data of previously developed software projects. However, various practical issues limit the availability of defect data for all modules in the training data. We present two solutions to the problem of software quality modeling when a limited number of training modules have known defect data. The proposed solutions are a semisupervised clustering with expert input scheme and a semisupervised classification approach with the expectation-maximization algorithm. Software measurement datasets obtained from multiple NASA software projects are used in our empirical investigation. The software quality knowledge learnt during the semisupervised learning processes provided good generalization performances for multiple test datasets. In addition, both solutions provided better predictions compared to a supervised learner trained on the initial labeled dataset.


Author(s):  
Fedja Hadzic ◽  
Tharam S. Dillon

Real world datasets are often accompanied with various types of anomalous or exceptional entries which are often referred to as outliers. Detecting outliers and distinguishing noise form true exceptions is important for effective data mining. This chapter presents two methods for outlier detection and analysis using the self-organizing map (SOM), where one is more suitable for categorical and the other for continuous data. They are generally based on filtering out the instances which are not captured by or are contradictory to the obtained concept hierarchy for the domain. We demonstrate how the dimension of the output space plays an important role in the kind of patterns that will be detected as outlying. Furthermore, the concept hierarchy itself provides extra criteria for distinguishing noise from true exceptions. The effectiveness of the proposed outlier detection and analysis strategy is demonstrated through the experiments on publicly available real world datasets.


Author(s):  
Petra Perner

This chapter introduces image mining as a method to discover implicit, previously unknown and potentially useful information from digital image and video repositories. It argues that image mining is a special discipline because of the special type of data and therefore, image-mining methods that consider the special data representation and the different aspects of image mining have to be developed. Furthermore, a bridge has to be established between image mining and image processing, feature extraction and image understanding since the later topics are concerned with the development of methods for the automatic extraction of higher-level image representations. We introduce our methodology, the developed methods and the system for image mining which we successfully applied to several medical image-diagnostic tasks.


Author(s):  
Amandeep S. Sidhu ◽  
Paul J. Kennedy ◽  
Simeon Simoff

In some real-world areas, it is important to enrich the data with external background knowledge so as to provide context and to facilitate pattern recognition. These areas may be described as data rich but knowledge poor. There are two challenges to incorporate this biological knowledge into the data mining cycle: (1) generating the ontologies; and (2) adapting the data mining algorithms to make use of the ontologies. This chapter presents the state-of-the-art in bringing the background ontology knowledge into the pattern recognition task for biomedical data.


Author(s):  
Anna Olecka

This chapter will focus on challenges in modeling credit risk for new accounts acquisition process in the credit card industry. First section provides an overview and a brief history of credit scoring. The second section looks at some of the challenges specific to the credit industry. In many of these applications business objective is tied only indirectly to the classification scheme. Opposing objectives, such as response, profit and risk, often play a tug of war with each other. Solving a business problem of such complex nature often requires a multiple of models working jointly. Challenges to data mining lie in exploring solutions that go beyond traditional, well-documented methodology and need for simplifying assumptions; often necessitated by the reality of dataset sizes and/or implementation issues. Examples of such challenges form an illustrative example of a compromise between data mining theory and applications.


Author(s):  
Jia-Yu Pan ◽  
Hyung-Jeong Yang ◽  
Christos Faloutsos

Multimedia objects like video clips or captioned images contain data of various modalities such as image, audio, and transcript text. Correlations across different modalities provide information about the multimedia content, and are useful in applications ranging from summarization to semantic captioning. We propose a graph-based method, MAGIC, which represents multimedia data as a graph and can find cross-modal correlations using “random walks with restarts.” MAGIC has several desirable properties: (a) it is general and domain-independent; (b) it can detect correlations across any two modalities; (c) it is insensitive to parameter settings; (d) it scales up well for large datasets; (e) it enables novel multimedia applications (e.g., group captioning); and (f) it creates opportunity for applying graph algorithms to multimedia problems. When applied to automatic image captioning, MAGIC finds correlations between text and image and achieves a relative improvement of 58% in captioning accuracy as compared to recent machine learning techniques.


Sign in / Sign up

Export Citation Format

Share Document