scholarly journals Mining Feature Relationships in Data

2021 ◽  
Author(s):  
Andrew Lensen

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.

2021 ◽  
Author(s):  
Andrew Lensen

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.


2021 ◽  
Vol 4 ◽  
Author(s):  
Bradley Butcher ◽  
Vincent S. Huang ◽  
Christopher Robinson ◽  
Jeremy Reffin ◽  
Sema K. Sgaier ◽  
...  

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.


Author(s):  
Goksu Tuysuzoglu ◽  
Derya Birant

Through the use of internet of things-based sensors in air quality monitoring stations, concentration of different pollutants and meteorological parameters can be regularly measured. In case of unusual conditions (e.g., increased levels of dangerous pollutants), a smart assessment system can produce warning so that appropriate air quality management process can be initiated. In this context, the objective of this study is to discover relationships and patterns among air pollution features and characteristics. In this case, determination of frequently observed association rules can trigger an appropriate background smart environment system when a critical situation is detected. In the experimental studies in the current project, traditional association rule mining and weighted association rule mining methods have been employed using real-world datasets collected from 21 monitoring stations in Turkey. In consequence, useful and outstanding association rules exceeding the user-defined support and confidence levels were obtained that can form basis for further research.


Author(s):  
Shuji Hao ◽  
Peilin Zhao ◽  
Yong Liu ◽  
Steven C. H. Hoi ◽  
Chunyan Miao

Relative similarity learning~(RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with real-world data arriving sequentially. These methods are often designed to learn a single similarity function for a specific task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Specifically, we first develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efficacy and efficiency in extensive experiments on real-world datasets.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Rosy Tsopra ◽  
Xose Fernandez ◽  
Claudio Luchinat ◽  
Lilia Alberghina ◽  
Hans Lehrach ◽  
...  

Abstract Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.


Data Mining ◽  
2013 ◽  
pp. 125-141
Author(s):  
Fernando Benites ◽  
Elena Sapozhnikova

Methods for the automatic extraction of taxonomies and concept hierarchies from data have recently emerged as essential assistance for humans in ontology construction. The objective of this chapter is to show how the extraction of concept hierarchies and finding relations between them can be effectively coupled with a multi-label classification task. The authors introduce a data mining system which performs classification and addresses both issues by means of association rule mining. The proposed system has been tested on two real-world datasets with the class labels of each dataset coming from two different class hierarchies. Several experiments on hierarchy extraction and concept relation were conducted in order to evaluate the system and three different interestingness measures were applied, to select the most important relations between concepts. One of the measures was developed by the authors. The experimental results showed that the system is able to infer quite accurate concept hierarchies and associations among the concepts. It is therefore well suited for classification-based reasoning.


Author(s):  
Fernando Benites ◽  
Elena Sapozhnikova

Methods for the automatic extraction of taxonomies and concept hierarchies from data have recently emerged as essential assistance for humans in ontology construction. The objective of this chapter is to show how the extraction of concept hierarchies and finding relations between them can be effectively coupled with a multi-label classification task. The authors introduce a data mining system which performs classification and addresses both issues by means of association rule mining. The proposed system has been tested on two real-world datasets with the class labels of each dataset coming from two different class hierarchies. Several experiments on hierarchy extraction and concept relation were conducted in order to evaluate the system and three different interestingness measures were applied, to select the most important relations between concepts. One of the measures was developed by the authors. The experimental results showed that the system is able to infer quite accurate concept hierarchies and associations among the concepts. It is therefore well suited for classification-based reasoning.


2020 ◽  
Vol 13 (10) ◽  
pp. 1709-1722
Author(s):  
Stefan Neumann ◽  
Pauli Miettinen

We study clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm which after one pass over the stream recovers the set of clusters on the right side of the graph using sublinear space; to the best of our knowledge this is the first algorithm with this property. We also show that after a second pass over the stream the left clusters of the bipartite graph can be recovered and we show how to extend our algorithm to solve the Boolean matrix factorization problem (by exploiting the correspondence of Boolean matrices and bipartite graphs). We evaluate an implementation of the algorithm on synthetic data and on real-world data. On real-world datasets the algorithm is orders of magnitudes faster than a static baseline algorithm while providing quality results within a factor 2 of the baseline algorithm. Our algorithm scales linearly in the number of edges in the graph. Finally, we analyze the algorithm theoretically and provide sufficient conditions under which the algorithm recovers a set of planted clusters under a standard random graph model.


Sign in / Sign up

Export Citation Format

Share Document