Biclustering and boolean matrix factorization in data streams

We study clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm which after one pass over the stream recovers the set of clusters on the right side of the graph using sublinear space; to the best of our knowledge this is the first algorithm with this property. We also show that after a second pass over the stream the left clusters of the bipartite graph can be recovered and we show how to extend our algorithm to solve the Boolean matrix factorization problem (by exploiting the correspondence of Boolean matrices and bipartite graphs). We evaluate an implementation of the algorithm on synthetic data and on real-world data. On real-world datasets the algorithm is orders of magnitudes faster than a static baseline algorithm while providing quality results within a factor 2 of the baseline algorithm. Our algorithm scales linearly in the number of edges in the graph. Finally, we analyze the algorithm theoretically and provide sufficient conditions under which the algorithm recovers a set of planted clusters under a standard random graph model.

Download Full-text

Model order selection for approximate Boolean matrix factorization problem

Knowledge-Based Systems ◽

10.1016/j.knosys.2021.107184 ◽

2021 ◽

pp. 107184

Author(s):

Martin Trnecka ◽

Marketa Trneckova

Keyword(s):

Matrix Factorization ◽

Boolean Matrix ◽

Order Selection ◽

Model Order Selection ◽

Model Order ◽

Factorization Problem ◽

Selection For

Download Full-text

Concept Drift Adaptation Techniques in Distributed Environment for Real-World Data Streams

Smart Cities ◽

10.3390/smartcities4010021 ◽

2021 ◽

Vol 4 (1) ◽

pp. 349-371

Author(s):

Hassan Mehmood ◽

Panos Kostakos ◽

Marta Cortes ◽

Theodoros Anagnostopoulos ◽

Susanna Pirttikangas ◽

...

Keyword(s):

Real World ◽

Data Streams ◽

Smart City ◽

Smart Cities ◽

Concept Drift ◽

Distributed Environment ◽

Real World Data ◽

Unique Challenge ◽

World Data ◽

Concept Drift Detection

Real-world data streams pose a unique challenge to the implementation of machine learning (ML) models and data analysis. A notable problem that has been introduced by the growth of Internet of Things (IoT) deployments across the smart city ecosystem is that the statistical properties of data streams can change over time, resulting in poor prediction performance and ineffective decisions. While concept drift detection methods aim to patch this problem, emerging communication and sensing technologies are generating a massive amount of data, requiring distributed environments to perform computation tasks across smart city administrative domains. In this article, we implement and test a number of state-of-the-art active concept drift detection algorithms for time series analysis within a distributed environment. We use real-world data streams and provide critical analysis of results retrieved. The challenges of implementing concept drift adaptation algorithms, along with their applications in smart cities, are also discussed.

Download Full-text

An Accelerated Symmetric Nonnegative Matrix Factorization Algorithm Using Extrapolation

Symmetry ◽

10.3390/sym12071187 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1187

Author(s):

Peitao Wang ◽

Zhaoshui He ◽

Jun Lu ◽

Beihai Tan ◽

YuLei Bai ◽

...

Keyword(s):

Real World ◽

Matrix Factorization ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Low Rank ◽

Tensor Factorization ◽

Real World Data ◽

Restart Strategy ◽

Extrapolation Scheme ◽

Symmetric Nonnegative Matrix Factorization

Symmetric nonnegative matrix factorization (SNMF) approximates a symmetric nonnegative matrix by the product of a nonnegative low-rank matrix and its transpose. SNMF has been successfully used in many real-world applications such as clustering. In this paper, we propose an accelerated variant of the multiplicative update (MU) algorithm of He et al. designed to solve the SNMF problem. The accelerated algorithm is derived by using the extrapolation scheme of Nesterov and a restart strategy. The extrapolation scheme plays a leading role in accelerating the MU algorithm of He et al. and the restart strategy ensures that the objective function of SNMF is monotonically decreasing. We apply the accelerated algorithm to clustering problems and symmetric nonnegative tensor factorization (SNTF). The experiment results on both synthetic and real-world data show that it is more than four times faster than the MU algorithm of He et al. and performs favorably compared to recent state-of-the-art algorithms.

Download Full-text

Discrete Trust-aware Matrix Factorization for Fast Recommendation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/191 ◽

2019 ◽

Author(s):

Guibing Guo ◽

Enneng Yang ◽

Li Shen ◽

Xiaochun Yang ◽

Xiaodong He

Keyword(s):

Social Influence ◽

Collaborative Filtering ◽

Recommender Systems ◽

Social Relations ◽

Real World ◽

Matrix Factorization ◽

State Of The Art ◽

Proposed Model ◽

Hamming Space ◽

Real World Datasets

Trust-aware recommender systems have received much attention recently for their abilities to capture the influence among connected users. However, they suffer from the efficiency issue due to large amount of data and time-consuming real-valued operations. Although existing discrete collaborative filtering may alleviate this issue to some extent, it is unable to accommodate social influence. In this paper we propose a discrete trust-aware matrix factorization (DTMF) model to take dual advantages of both social relations and discrete technique for fast recommendation. Specifically, we map the latent representation of users and items into a joint hamming space by recovering the rating and trust interactions between users and items. We adopt a sophisticated discrete coordinate descent (DCD) approach to optimize our proposed model. In addition, experiments on two real-world datasets demonstrate the superiority of our approach against other state-of-the-art approaches in terms of ranking accuracy and efficiency.

Download Full-text

Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.612551 ◽

2021 ◽

Vol 4 ◽

Author(s):

Bradley Butcher ◽

Vincent S. Huang ◽

Christopher Robinson ◽

Jeremy Reffin ◽

Sema K. Sgaier ◽

...

Keyword(s):

Global Health ◽

Bayesian Networks ◽

Sample Size ◽

Observational Data ◽

Real World ◽

Structure Learning ◽

Ground Truth ◽

Research Process ◽

Real World Data ◽

Real World Datasets

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Download Full-text

SPMC: Socially-Aware Personalized Markov Chains for Sparse Sequential Recommendation

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/204 ◽

2017 ◽

Cited By ~ 13

Author(s):

Chenwei Cai ◽

Ruining He ◽

Julian McAuley

Keyword(s):

Markov Chains ◽

Social Relationships ◽

Real World ◽

Matrix Factorization ◽

State Of The Art ◽

Cold Start ◽

Additional Information ◽

New Methods ◽

Real World Datasets ◽

Sequential Information

Dealing with sparse, long-tailed datasets, and cold-start problems is always a challenge for recommender systems. These issues can partly be dealt with by making predictions not in isolation, but by leveraging information from related events; such information could include signals from social relationships or from the sequence of recent activities. Both types of additional information can be used to improve the performance of state-of-the-art matrix factorization-based techniques. In this paper, we propose new methods to combine both social and sequential information simultaneously, in order to further improve recommendation performance. We show these techniques to be particularly effective when dealing with sparsity and cold-start issues in several large, real-world datasets.

Download Full-text

Online Multitask Relative Similarity Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/253 ◽

2017 ◽

Cited By ~ 2

Author(s):

Shuji Hao ◽

Peilin Zhao ◽

Yong Liu ◽

Steven C. H. Hoi ◽

Chunyan Miao

Keyword(s):

Real World ◽

Learning Algorithm ◽

Learning Problems ◽

Similarity Function ◽

Learning Approaches ◽

Similarity Learning ◽

Real World Data ◽

Real World Datasets ◽

Online Learning Algorithm ◽

Relative Similarity

Relative similarity learning~(RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with real-world data arriving sequentially. These methods are often designed to learn a single similarity function for a specific task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Specifically, we first develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efficacy and efficiency in extensive experiments on real-world datasets.

Download Full-text

Mining Feature Relationships in Data

10.26686/wgtn.14456337.v1 ◽

2021 ◽

Author(s):

Andrew Lensen

Keyword(s):

Association Rules ◽

Real World ◽

Programming Approach ◽

Rule Mining ◽

Real World Data ◽

Alternative Approach ◽

Exploratory Data ◽

Real World Datasets ◽

Symbolic Approach ◽

Insight Into

When faced with a new dataset, most practitioners begin by performing exploratory data analysis to discover interesting patterns and characteristics within data. Techniques such as association rule mining are commonly applied to uncover relationships between features (attributes) of the data. However, association rules are primarily designed for use on binary or categorical data, due to their use of rule-based machine learning. A large proportion of real-world data is continuous in nature, and discretisation of such data leads to inaccurate and less informative association rules. In this paper, we propose an alternative approach called feature relationship mining (FRM), which uses a genetic programming approach to automatically discover symbolic relationships between continuous or categorical features in data. To the best of our knowledge, our proposed approach is the first such symbolic approach with the goal of explicitly discovering relationships between features. Empirical testing on a variety of real-world datasets shows the proposed method is able to find high-quality, simple feature relationships which can be easily interpreted and which provide clear and non-trivial insight into data.

Download Full-text

A framework for validating AI in precision medicine: considerations from the European ITFoC consortium

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01634-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Rosy Tsopra ◽

Xose Fernandez ◽

Claudio Luchinat ◽

Lilia Alberghina ◽

Hans Lehrach ◽

...

Keyword(s):

Treatment Response ◽

Real World ◽

Clinical Decision Making ◽

Precision Oncology ◽

Clinical Validation ◽

Learning Approaches ◽

Real World Data ◽

Privacy And Security ◽

The Future ◽

Real World Datasets

Abstract Background Artificial intelligence (AI) has the potential to transform our healthcare systems significantly. New AI technologies based on machine learning approaches should play a key role in clinical decision-making in the future. However, their implementation in health care settings remains limited, mostly due to a lack of robust validation procedures. There is a need to develop reliable assessment frameworks for the clinical validation of AI. We present here an approach for assessing AI for predicting treatment response in triple-negative breast cancer (TNBC), using real-world data and molecular -omics data from clinical data warehouses and biobanks. Methods The European “ITFoC (Information Technology for the Future Of Cancer)” consortium designed a framework for the clinical validation of AI technologies for predicting treatment response in oncology. Results This framework is based on seven key steps specifying: (1) the intended use of AI, (2) the target population, (3) the timing of AI evaluation, (4) the datasets used for evaluation, (5) the procedures used for ensuring data safety (including data quality, privacy and security), (6) the metrics used for measuring performance, and (7) the procedures used to ensure that the AI is explainable. This framework forms the basis of a validation platform that we are building for the “ITFoC Challenge”. This community-wide competition will make it possible to assess and compare AI algorithms for predicting the response to TNBC treatments with external real-world datasets. Conclusions The predictive performance and safety of AI technologies must be assessed in a robust, unbiased and transparent manner before their implementation in healthcare settings. We believe that the consideration of the ITFoC consortium will contribute to the safe transfer and implementation of AI in clinical settings, in the context of precision oncology and personalized care.

Download Full-text

Matrix factorization completed multicontext data for tensor-enhanced recommendation

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210641 ◽

2021 ◽

pp. 1-12

Author(s):

Shangju Deng ◽

Jiwei Qin

Keyword(s):

Recommender Systems ◽

Real World ◽

Matrix Factorization ◽

Cold Start ◽

User Preferences ◽

High Dimensional ◽

Satisfactory Performance ◽

Interactive Data ◽

Real World Datasets ◽

Item Data

Tensors have been explored to share latent user-item relations and have been shown to be effective for recommendation. Tensors suffer from sparsity and cold start problems in real recommendation scenarios; therefore, researchers and engineers usually use matrix factorization to address these issues and improve the performance of recommender systems. In this paper, we propose matrix factorization completed multicontext data for tensor-enhanced algorithm a using matrix factorization combined with a multicontext data method for tensor-enhanced recommendation. To take advantage of existing user-item data, we add the context time and trust to enrich the interactive data via matrix factorization. In addition, Our approach is a high-dimensional tensor framework that further mines the latent relations from the user-item-trust-time tensor to improve recommendation performance. Through extensive experiments on real-world datasets, we demonstrated the superiority of our approach in predicting user preferences. This method is also shown to be able to maintain satisfactory performance even if user-item interactions are sparse.

Download Full-text