Incorporating side information into Robust Matrix Factorization with Quantile Random Forest under Bayesian framework (preprint)

Mapping Intimacies ◽

10.31226/osf.io/b8jke ◽

2019 ◽

Author(s):

Andrey Babkin

Keyword(s):

Matrix Factorization ◽

Side Information ◽

Model Performance ◽

Optimization Procedure ◽

Real Data ◽

Bayesian Framework ◽

Data Sets ◽

Additional Information ◽

Pure Matrix ◽

Surrogate Function

Matrix Factorization is a widely used technique for modeling pairwise and matrix-like data. It is frequently used in pattern recognition, topic analysis and other areas. Side information is often available, however utilization of this additional information is problematic in the pure matrix factorization framework. This article proposes a novel method of utilizing side information by combining arbitrary nonlinear Quantile Regression model and Matrix Factorization under Bayesian framework. Gradient-free optimization procedure with the novel Surrogate Function is used to solve the resulting MAP estimator. The model performance has been evaluated on real data-sets.

Download Full-text

The Automatic Non-Negative Matrix Factorization of the Hierarchy Clustering Method

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.325-326.1489 ◽

2013 ◽

Vol 325-326 ◽

pp. 1489-1492

Author(s):

Tie Qi Li ◽

Wen Shuo Zhang

Keyword(s):

Matrix Factorization ◽

Vector Space Model ◽

Real Data ◽

Data Sets ◽

Text Data ◽

Space Model ◽

Hierarchical Relations ◽

Weight Calculation ◽

Novel Method ◽

Non Negative Matrix Factorization

People in such huge information how to find useful information becomes a problem. In order to deal with hierarchical relations in text data, a novel method, called automatic non-negative matrix factorization of the hierarchy clustering, is proposed for the text mining. We use the vector space model as the research foundation, mainly discusses the feature selection and weight calculation two problems. The experimental results on the real data sets demonstrate that our method outperforms, on average, all the other 6 methods.

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

SEMI-SUPERVISED FUZZY CLUSTERING WITH LEARNABLE CLUSTER DEPENDENT KERNELS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500139 ◽

2013 ◽

Vol 22 (03) ◽

pp. 1350013 ◽

Cited By ~ 2

Author(s):

OUIEM BCHIR ◽

HICHEM FRIGUI ◽

MOHAMED MAHER BEN ISMAIL

Keyword(s):

Fuzzy Clustering ◽

Side Information ◽

Metric Learning ◽

Real Data ◽

Distance Functions ◽

Gaussian Kernel ◽

Cost Functions ◽

Data Sets ◽

Learning Approaches ◽

Data Set

Many machine learning applications rely on learning distance functions with side information. Most of these distance metric learning approaches learns a Mahalanobis distance. While these approaches may work well when data is in low dimensionality, they become computationally expensive or even infeasible for high dimensional data. In this paper, we propose a novel method of learning nonlinear distance functions with side information while clustering the data. The new semi-supervised clustering approach is called Semi-Supervised Fuzzy clustering with Learnable Cluster dependent Kernels (SS-FLeCK). The proposed algorithm learns the underlying cluster-dependent dissimilarity measure while finding compact clusters in the given data set. The learned dissimilarity is based on a Gaussian kernel function with cluster dependent parameters. This objective function integrates penalty and reward cost functions. These cost functions are weighted by fuzzy membership degrees. Moreover, they use side-information in the form of a small set of constraints on which instances should or should not reside in the same cluster. The proposed algorithm uses only the pairwise relation between the feature vectors. This makes it applicable when similar objects cannot be represented by a single prototype. Using synthetic and real data sets, we show that SS-FLeCK outperforms several other algorithms.

Download Full-text

Graphs Regularized Robust Matrix Factorization and Its Application on Student Grade Prediction

Applied Sciences ◽

10.3390/app10051755 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1755 ◽

Cited By ~ 1

Author(s):

Yupei Zhang ◽

Yue Yun ◽

Huan Dai ◽

Jiaqi Cui ◽

Xuequn Shang

Keyword(s):

Matrix Factorization ◽

Teaching And Learning ◽

Side Information ◽

Low Rank ◽

Data Sets ◽

Data Set ◽

Mm Algorithm ◽

Grade Prediction ◽

Public Data ◽

Learned Features

Student grade prediction (SGP) is an important educational problem for designing personalized strategies of teaching and learning. Many studies adopt the technique of matrix factorization (MF). However, their methods often focus on the grade records regardless of the side information, such as backgrounds and relationships. To this end, in this paper, we propose a new MF method, called graph regularized robust matrix factorization (GRMF), based on the recent robust MF version. GRMF integrates two side graphs built on the side data of students and courses into the objective of robust low-rank MF. As a result, the learned features of students and courses can grasp more priors from educational situations to achieve higher grade prediction results. The resulting objective problem can be effectively optimized by the Majorization Minimization (MM) algorithm. In addition, GRMF not only can yield the specific features for the education domain but can also deal with the case of missing, noisy, and corruptive data. To verify our method, we test GRMF on two public data sets for rating prediction and image recovery. Finally, we apply GRMF to educational data from our university, which is composed of 1325 students and 832 courses. The extensive experimental results manifestly show that GRMF is robust to various data problem and achieves more effective features in comparison with other methods. Moreover, GRMF also delivers higher prediction accuracy than other methods on our educational data set. This technique can facilitate personalized teaching and learning in higher education.

Download Full-text

Scalable Probabilistic Matrix Factorization with Graph-Based Priors

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6043 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5851-5858

Author(s):

Jonathan Strahl ◽

Jaakko Peltonen ◽

Hirsohi Mamitsuka ◽

Samuel Kaski

Keyword(s):

Matrix Factorization ◽

Prediction Accuracy ◽

Side Information ◽

Matrix Completion ◽

Real Data ◽

Data Matrix ◽

Laptop Computer ◽

Completion Problem ◽

Graphical Lasso ◽

The Matrix

In matrix factorization, available graph side-information may not be well suited for the matrix completion problem, having edges that disagree with the latent-feature relations learnt from the incomplete data matrix. We show that removing these contested edges improves prediction accuracy and scalability. We identify the contested edges through a highly-efficient graphical lasso approximation. The identification and removal of contested edges adds no computational complexity to state-of-the-art graph-regularized matrix factorization, remaining linear with respect to the number of non-zeros. Computational load even decreases proportional to the number of edges removed. Formulating a probabilistic generative model and using expectation maximization to extend graph-regularised alternating least squares (GRALS) guarantees convergence. Rich simulated experiments illustrate the desired properties of the resulting algorithm. On real data experiments we demonstrate improved prediction accuracy with fewer graph edges (empirical evidence that graph side-information is often inaccurate). A 300 thousand dimensional graph with three million edges (Yahoo music side-information) can be analyzed in under ten minutes on a standard laptop computer demonstrating the efficiency of our graph update.

Download Full-text

CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/35/4/14123 ◽

2019 ◽

Vol 35 (4) ◽

pp. 373-384

Author(s):

Cuong Le ◽

Viet Vu Vu ◽

Le Thi Kieu Oanh ◽

Nguyen Thi Hai Yen

Keyword(s):

Learning Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Efficient Data ◽

Graph Based Clustering

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.

Download Full-text

TrustSVD: A Novel Trust-Based Matrix Factorization Model with User Trust and Item Ratings

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i11.422 ◽

2017 ◽

Vol 7 (11) ◽

pp. 7 ◽

Cited By ~ 1

Author(s):

K Sobha Rani

Keyword(s):

Matrix Factorization ◽

Social Trust ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Recommendation Algorithm ◽

Active User ◽

Factorization Model ◽

The Social ◽

Matrix Factorization Technique

Collaborative filtering suffers from the problems of data sparsity and cold start, which dramatically degrade recommendation performance. To help resolve these issues, we propose TrustSVD, a trust-based matrix factorization technique. By analyzing the social trust data from four real-world data sets, we conclude that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Hence, we build on top of a state-of-the-art recommendation algorithm SVD++ which inherently involves the explicit and implicit influence of rated items, by further incorporating both the explicit and implicit influence of trusted users on the prediction of items for an active user. To our knowledge, the work reported is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that our approach TrustSVD achieves better accuracy than other ten counterparts, and can better handle the concerned issues.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text