Mining Feature Relationships in Data

10.26686/wgtn.14456337 ◽

2021 ◽

Author(s):

Andrew Lensen

Keyword(s):

Association Rules ◽

Real World ◽

Programming Approach ◽

Rule Mining ◽

Real World Data ◽

Alternative Approach ◽

Exploratory Data ◽

Real World Datasets ◽

Symbolic Approach ◽

Insight Into

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.

Download Full-text

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.612551 ◽

2021 ◽

Vol 4 ◽

Author(s):

Bradley Butcher ◽

Vincent S. Huang ◽

Christopher Robinson ◽

Jeremy Reffin ◽

Sema K. Sgaier ◽

...

Keyword(s):

Global Health ◽

Bayesian Networks ◽

Sample Size ◽

Observational Data ◽

Real World ◽

Structure Learning ◽

Ground Truth ◽

Research Process ◽

Real World Data ◽

Real World Datasets

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Download Full-text

Air Pollution Monitoring in Intelligent Cities Using Weighted Association Rule Mining

Developing and Monitoring Smart Environments for Intelligent Cities - Advances in Civil and Industrial Engineering ◽

10.4018/978-1-7998-5062-5.ch007 ◽

2021 ◽

pp. 171-197

Author(s):

Goksu Tuysuzoglu ◽

Derya Birant

Keyword(s):

Air Pollution ◽

Air Quality ◽

Association Rules ◽

Association Rule ◽

Association Rule Mining ◽

Experimental Studies ◽

Rule Mining ◽

Air Pollution Monitoring ◽

Real World Datasets ◽

Monitoring Stations

Through the use of internet of things-based sensors in air quality monitoring stations, concentration of different pollutants and meteorological parameters can be regularly measured. In case of unusual conditions (e.g., increased levels of dangerous pollutants), a smart assessment system can produce warning so that appropriate air quality management process can be initiated. In this context, the objective of this study is to discover relationships and patterns among air pollution features and characteristics. In this case, determination of frequently observed association rules can trigger an appropriate background smart environment system when a critical situation is detected. In the experimental studies in the current project, traditional association rule mining and weighted association rule mining methods have been employed using real-world datasets collected from 21 monitoring stations in Turkey. In consequence, useful and outstanding association rules exceeding the user-defined support and confidence levels were obtained that can form basis for further research.

Download Full-text

FRI0200 Insight into the quality of life of patients with ankylosing spondylitis: real-world data from a us-based life impact survey

10.1136/annrheumdis-2018-eular.2650 ◽

2018 ◽

Author(s):

J.T. Rosenbaum ◽

L. Pisenti ◽

Y. Park ◽

R. Howard

Keyword(s):

Quality Of Life ◽

Ankylosing Spondylitis ◽

Real World ◽

Real World Data ◽

World Data ◽

Life Impact ◽

Insight Into

Download Full-text

Online Multitask Relative Similarity Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/253 ◽

2017 ◽

Cited By ~ 2

Author(s):

Shuji Hao ◽

Peilin Zhao ◽

Yong Liu ◽

Steven C. H. Hoi ◽

Chunyan Miao

Keyword(s):

Real World ◽

Learning Algorithm ◽

Learning Problems ◽

Similarity Function ◽

Learning Approaches ◽

Similarity Learning ◽

Real World Data ◽

Real World Datasets ◽

Online Learning Algorithm ◽

Relative Similarity

Relative similarity learning~(RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with real-world data arriving sequentially. These methods are often designed to learn a single similarity function for a specific task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Specifically, we first develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efficacy and efficiency in extensive experiments on real-world datasets.

Download Full-text

A framework for validating AI in precision medicine: considerations from the European ITFoC consortium

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01634-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Rosy Tsopra ◽

Xose Fernandez ◽

Claudio Luchinat ◽

Lilia Alberghina ◽

Hans Lehrach ◽

...

Keyword(s):

Treatment Response ◽

Real World ◽

Clinical Decision Making ◽

Precision Oncology ◽

Clinical Validation ◽

Learning Approaches ◽

Real World Data ◽

Privacy And Security ◽

The Future ◽

Real World Datasets

Abstract Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.

Download Full-text

Learning Different Concept Hierarchies and the Relations between them from Classified Data

Data Mining ◽

10.4018/978-1-4666-2455-9.ch007 ◽

2013 ◽

pp. 125-141

Author(s):

Fernando Benites ◽

Elena Sapozhnikova

Keyword(s):

Data Mining ◽

Real World ◽

Rule Mining ◽

Mining System ◽

Interestingness Measures ◽

Class Hierarchies ◽

Data Mining System ◽

Real World Datasets ◽

Class Labels ◽

Concept Hierarchies

Methods for the automatic extraction of taxonomies and concept hierarchies from data have recently emerged as essential assistance for humans in ontology construction. The objective of this chapter is to show how the extraction of concept hierarchies and finding relations between them can be effectively coupled with a multi-label classification task. The authors introduce a data mining system which performs classification and addresses both issues by means of association rule mining. The proposed system has been tested on two real-world datasets with the class labels of each dataset coming from two different class hierarchies. Several experiments on hierarchy extraction and concept relation were conducted in order to evaluate the system and three different interestingness measures were applied, to select the most important relations between concepts. One of the measures was developed by the authors. The experimental results showed that the system is able to infer quite accurate concept hierarchies and associations among the concepts. It is therefore well suited for classification-based reasoning.

Download Full-text

Learning Different Concept Hierarchies and the Relations Between them from Classified Data

Intelligent Data Analysis for Real-Life Applications ◽

10.4018/978-1-4666-1806-0.ch002 ◽

2012 ◽

pp. 18-34 ◽

Cited By ~ 1

Author(s):

Fernando Benites ◽

Elena Sapozhnikova

Keyword(s):

Data Mining ◽

Real World ◽

Rule Mining ◽

Mining System ◽

Interestingness Measures ◽

Class Hierarchies ◽

Data Mining System ◽

Real World Datasets ◽

Class Labels ◽

Concept Hierarchies

Methods for the automatic extraction of taxonomies and concept hierarchies from data have recently emerged as essential assistance for humans in ontology construction. The objective of this chapter is to show how the extraction of concept hierarchies and finding relations between them can be effectively coupled with a multi-label classification task. The authors introduce a data mining system which performs classification and addresses both issues by means of association rule mining. The proposed system has been tested on two real-world datasets with the class labels of each dataset coming from two different class hierarchies. Several experiments on hierarchy extraction and concept relation were conducted in order to evaluate the system and three different interestingness measures were applied, to select the most important relations between concepts. One of the measures was developed by the authors. The experimental results showed that the system is able to infer quite accurate concept hierarchies and associations among the concepts. It is therefore well suited for classification-based reasoning.

Download Full-text

Insight into the Quality of Life of Patients with Ankylosing Spondylitis: Real-World Data from a US-Based Life Impact Survey

Rheumatology and Therapy ◽

10.1007/s40744-019-0160-8 ◽

2019 ◽

Vol 6 (3) ◽

pp. 353-367 ◽

Cited By ~ 4

Author(s):

James T. Rosenbaum ◽

Lisa Pisenti ◽

Yujin Park ◽

Richard A. Howard

Keyword(s):

Quality Of Life ◽

Ankylosing Spondylitis ◽

Real World ◽

Real World Data ◽

World Data ◽

Life Impact ◽

Insight Into

Download Full-text

Biclustering and boolean matrix factorization in data streams

Proceedings of the VLDB Endowment ◽

10.14778/3401960.3401968 ◽

2020 ◽

Vol 13 (10) ◽

pp. 1709-1722

Author(s):

Stefan Neumann ◽

Pauli Miettinen

Keyword(s):

Real World ◽

Data Streams ◽

Matrix Factorization ◽

Sufficient Conditions ◽

Bipartite Graphs ◽

Boolean Matrix ◽

Real World Data ◽

Factorization Problem ◽

Real World Datasets ◽

Baseline Algorithm

We study clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm which after one pass over the stream recovers the set of clusters on the right side of the graph using sublinear space; to the best of our knowledge this is the first algorithm with this property. We also show that after a second pass over the stream the left clusters of the bipartite graph can be recovered and we show how to extend our algorithm to solve the Boolean matrix factorization problem (by exploiting the correspondence of Boolean matrices and bipartite graphs). We evaluate an implementation of the algorithm on synthetic data and on real-world data. On real-world datasets the algorithm is orders of magnitudes faster than a static baseline algorithm while providing quality results within a factor 2 of the baseline algorithm. Our algorithm scales linearly in the number of edges in the graph. Finally, we analyze the algorithm theoretically and provide sufficient conditions under which the algorithm recovers a set of planted clusters under a standard random graph model.

Download Full-text