Data Mining, Validation, and Collaborative Knowledge Capture

For large-scale data mining, utilizing data from ubiquitous and mixed-structured data sources, the extraction and integration into a comprehensive data-warehouse is usually of prime importance. Then, appropriate methods for validation and potential refinement are essential. This chapter describes an approach for integrating data mining, information extraction, and validation with collaborative knowledge management and capture in order to improve the data acquisition processes. For collaboration, a semantic wiki-enabled system for knowledge and experience management is presented. The proposed approach applies information extraction techniques together with pattern mining methods for initial data validation and is applicable for heterogeneous sources, i.e., capable of integrating structured and unstructured data. The methods are integrated into an incremental process providing for continuous validation options. The approach has been developed in a health informatics context: The results of a medical application demonstrate that pattern mining and the applied rule-based information extraction methods are well suited for discovering, extracting and validating clinically relevant knowledge, as well as the applicability of the knowledge capture approach. The chapter presents experiences using a case-study in the medical domain of sonography.

Download Full-text

Data Mining, Validation, and Collaborative Knowledge Capture

Data Mining ◽

10.4018/978-1-4666-2455-9.ch061 ◽

2013 ◽

pp. 1189-1207 ◽

Cited By ~ 1

Author(s):

Martin Atzmueller ◽

Stephanie Beer ◽

Frank Puppe

Keyword(s):

Data Mining ◽

Information Extraction ◽

Health Informatics ◽

Large Scale ◽

Pattern Mining ◽

Medical Application ◽

Extraction Methods ◽

Knowledge Capture ◽

Acquisition Processes ◽

Collaborative Knowledge

Download Full-text

Mining user queries with information extraction methods and linked data

Journal of Documentation ◽

10.1108/jd-09-2017-0133 ◽

2018 ◽

Vol 74 (5) ◽

pp. 936-950

Author(s):

Anne Chardonnens ◽

Ettore Rizza ◽

Mathias Coeckelbergs ◽

Seth van Hooland

Keyword(s):

Information Extraction ◽

Large Scale ◽

Extraction Methods ◽

Knowledge Bases ◽

Entity Recognition ◽

Web Analytics ◽

Place Names ◽

Data Set ◽

Content Type ◽

User Queries

Purpose Advanced usage of web analytics tools allows to capture the content of user queries. Despite their relevant nature, the manual analysis of large volumes of user queries is problematic. The purpose of this paper is to address the problem of named entity recognition in digital library user queries. Design/methodology/approach The paper presents a large-scale case study conducted at the Royal Library of Belgium in its online historical newspapers platform BelgicaPress. The object of the study is a data set of 83,854 queries resulting from 29,812 visits over a 12-month period. By making use of information extraction methods, knowledge bases (KBs) and various authority files, this paper presents the possibilities and limits to identify what percentage of end users are looking for person and place names. Findings Based on a quantitative assessment, the method can successfully identify the majority of person and place names from user queries. Due to the specific character of user queries and the nature of the KBs used, a limited amount of queries remained too ambiguous to be treated in an automated manner. Originality/value This paper demonstrates in an empirical manner how user queries can be extracted from a web analytics tool and how named entities can then be mapped with KBs and authority files, in order to facilitate automated analysis of their content. Methods and tools used are generalisable and can be reused by other collection holders.

Download Full-text

Mining Repetitive Patterns in Multimedia Data

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch200 ◽

2011 ◽

pp. 1287-1291

Author(s):

Junsong Yuan

Keyword(s):

Data Mining ◽

Large Scale ◽

Pattern Mining ◽

Pattern Discovery ◽

Structured Data ◽

Multimedia Data ◽

Frequent Pattern ◽

Image Texture ◽

Visual Object ◽

Repetitive Pattern

One of the focused themes in data mining research is to discover frequent and repetitive patterns from the data. The success of frequent pattern mining (Han, Cheng, Xin, & Yan, 2007) in structured data (e.g., transaction data) and semi-structured data (e.g., text) has recently aroused our curiosity in applying them to multimedia data. Given a collection of unlabeled images, videos or audios, the objective of repetitive pattern discovery is to find (if there is any) similar patterns that appear repetitively in the whole dataset. Discovering such repetitive patterns in multimedia data brings in interesting new problems in data mining research. It also provides opportunities in solving traditional tasks in multimedia research, including visual similarity matching (Boiman & Irani, 2006), visual object retrieval (Sivic & Zisserman, 2004; Philbin, Chum, Isard, Sivic & Zisserman, 2007), categorization (Grauman & Darrell, 2006), recognition (Quack, Ferrari, Leibe & Gool, 2007; Amores, Sebe, & Radeva, 2007), as well as audio object search and indexing (Herley, 2006). • In image mining, frequent or repetitive patterns can be similar image texture regions, a specific visual object, or a category of objects. These repetitive patterns appear in a sub-collection of the images (Hong & Huang, 2004; Tan & Ngo, 2005; Yuan & Wu, 2007, Yuan, Wu & Yang, 2007; Yuan, Li, Fu, Wu & Huang, 2007). • In video mining, repetitive patterns can be repetitive short video clips (e.g. commercials) or temporal visual events that happen frequently in the given videos (Wang, Liu & Yang, 2005; Xie, Kennedy, Chang, Divakaran, Sun, & Lin, 2004; Yang, Xue, & Tian, 2005; Yuan, Wang, Meng, Wu & Li, 2007). • In audio mining, repetitive patterns can be repeated structures appearing in music (Lartillot, 2005) or broadcast audio (Herley, 2006). Repetitive pattern discovery is a challenging problem because we do not have any a prior knowledge of the possible repetitive patterns. For example, it is generally unknown in advance (i) what the repetitive patterns look like (e.g. shape and appearance of the repetitive object/contents of the repetitive clip); (ii) where (location) and how large (scale of the repetitive object or length of the repetitive clip) they are; (iii) how many repetitive patterns in total and how many instances each repetitive pattern has; or even (iv) whether such repetitive patterns exist at all. An exhaustive solution needs to search through all possible pattern sizes and locations, thus is extremely computationally demanding, if not impossible.

Download Full-text

Accelerated Discovery of High-Refractive-Index Polyimides via First-Principles Molecular Modeling, Virtual High-Throughput Screening, and Data Mining

10.26434/chemrxiv.7670903.v1 ◽

2019 ◽

Author(s):

Mohammad Atif Faiz Afzal ◽

Mojtaba Haghighatlari ◽

Sai Prasad Ganesh ◽

Chong Cheng ◽

Johannes Hachmann

Keyword(s):

Data Mining ◽

Refractive Index ◽

High Throughput ◽

First Principles ◽

High Throughput Screening ◽

Large Scale ◽

Computational Study ◽

High Refractive Index ◽

Structural Features ◽

Learning Program

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>

Download Full-text

A bootstrapping approach for robust topic analysis

Natural Language Engineering ◽

10.1017/s1351324902002929 ◽

2002 ◽

Vol 8 (2-3) ◽

pp. 209-233 ◽

Cited By ~ 1

Author(s):

OLIVIER FERRET ◽

BRIGITTE GRAU

Keyword(s):

Information Extraction ◽

Large Scale ◽

Text Summarization ◽

Great Precision ◽

Topic Analysis ◽

Structured Knowledge

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text