scholarly journals User encoding for clustering in very sparse recommender systems tasks

Author(s):  
Pablo Pérez-Núnez ◽  
Jorge Díez ◽  
Oscar Luaces ◽  
Antonio Bahamonde

AbstractRecommender Systems are a very useful tool which let companies and service providers focus in the preferences of their customers, helping them to avoid an overwhelming variety of choices. In this context, clustering tools can play an important role to detect groups of customers with similar tastes. Thus, companies can make personalized marketing campaigns, offering to their users new products which have been consumed by other users with comparable preferences. In this paper we present a general framework to cluster users with respect to their tastes when the registers stored about the interactions between users and products are extremely scarce. Commonly, clustering methods employ the values of features describing the samples to be clustered (users in our case), but such features are not always available. We propose some alternative representations for users, in which their tastes are gathered to some extent, so that clustering algorithms can take advantage and make more homogeneous groups in this regard. To illustrate the performance of the whole framework, we tested it on six popular datasets commonly used as a benchmark for recommender systems, as well as on an extremely sparse real-world dataset that records the preferences of readers to click promoted links in digital publications. In the experimental section we compare our proposed representations to other common user encodings. We show that clustering users attending only to their feature values or to the items they have evaluated gives rise to the worst scores in terms of taste homogeneity.

10.12737/7483 ◽  
2014 ◽  
Vol 8 (7) ◽  
pp. 0-0
Author(s):  
Олег Сдвижков ◽  
Oleg Sdvizhkov

Cluster analysis [3] is a relatively new branch of mathematics that studies the methods partitioning a set of objects, given a finite set of attributes into homogeneous groups (clusters). Cluster analysis is widely used in psychology, sociology, economics (market segmentation), and many other areas in which there is a problem of classification of objects according to their characteristics. Clustering methods implemented in a package STATISTICA [1] and SPSS [2], they return the partitioning into clusters, clustering and dispersion statistics dendrogram of hierarchical clustering algorithms. MS Excel Macros for main clustering methods and application examples are given in the monograph [5]. One of the central problems of cluster analysis is to define some criteria for the number of clusters, we denote this number by K, into which separated are a given set of objects. There are several dozen approaches [4] to determine the number K. In particular, according to [6], the number of clusters K - minimum number which satisfies where - the minimum value of total dispersion for partitioning into K clusters, N - number of objects. Among the clusters automatically causes the consistent application of abnormal clusters [4]. In 2010, proposed and experimentally validated was a method for obtaining the number of K by applying the density function [4]. The article offers two simple approaches to determining K, where each cluster has at least two objects. In the first number K is determined by the shortest Hamiltonian cycles in the second - through the minimum spanning tree. The examples of clustering with detailed step by step solutions and graphic illustrations are suggested. Shown is the use of macro VBA Excel, which returns the minimum spanning tree to the problems of clustering. The article contains a macro code, with commentaries to the main unit.


2016 ◽  
Vol 2016 ◽  
pp. 1-8 ◽  
Author(s):  
Jinhua Li ◽  
Shiji Song ◽  
Yuli Zhang ◽  
Zhen Zhou

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.


2021 ◽  
Vol 7 (4) ◽  
pp. 1-41
Author(s):  
Radu Mariescu-Istodor ◽  
Alexandru Cristian ◽  
Mihai Negrea ◽  
Peiwei Cao

The Vehicle Routing Problem (VRP) is an NP hard problem where we need to optimize itineraries for agents to visit multiple targets. When considering real-world travel (road-network topology, speed limits and traffic), modern VRP solvers can only process small instances with a few hundred targets. We propose a framework (VRPDiv) that can scale any solver to support larger VRP instances with up to ten thousand targets (10k) by dividing them into smaller clusters. VRPDiv supports the multiple VRP scenarios and contains a pool of clustering algorithms from which it chooses the ideal one depending on properties of the instance. VRPDiv assigns agents based on cluster demand and targets compatibility (i.e. realizable time-windows and capacity limitations). We incorporate the framework into the Bing Maps Multi-Itinerary Optimization (MIO) 1 online service. This architecture allows MIO to scale up from solving instances with a few hundred to over 10k targets in under 10 minutes. We evaluate our framework on public datasets and publish a new dataset ourselves, as large enough instances supporting real-world travel were impossible to find. We investigate multiple clustering methods and show that choosing the correct one is critical with differences of up to 60% in quality. We compare with relevant baselines and report a 40% improvement in target allocation and a 9.8% improvement in itinerary durations. We compare with existing scores and report an average delta of 10%, with lower values (<5%) in instances with low workload (few targets per agent), which are acceptable for an online service.


2009 ◽  
Vol 2009 ◽  
pp. 1-16 ◽  
Author(s):  
David J. Miller ◽  
Carl A. Nelson ◽  
Molly Boeka Cannon ◽  
Kenneth P. Cannon

Fuzzy clustering algorithms are helpful when there exists a dataset with subgroupings of points having indistinct boundaries and overlap between the clusters. Traditional methods have been extensively studied and used on real-world data, but require users to have some knowledge of the outcome a priori in order to determine how many clusters to look for. Additionally, iterative algorithms choose the optimal number of clusters based on one of several performance measures. In this study, the authors compare the performance of three algorithms (fuzzy c-means, Gustafson-Kessel, and an iterative version of Gustafson-Kessel) when clustering a traditional data set as well as real-world geophysics data that were collected from an archaeological site in Wyoming. Areas of interest in the were identified using a crisp cutoff value as well as a fuzzyα-cut to determine which provided better elimination of noise and non-relevant points. Results indicate that theα-cut method eliminates more noise than the crisp cutoff values and that the iterative version of the fuzzy clustering algorithm is able to select an optimum number of subclusters within a point set (in both the traditional and real-world data), leading to proper indication of regions of interest for further expert analysis


2014 ◽  
Vol 2014 ◽  
pp. 1-8 ◽  
Author(s):  
Kang Zhang ◽  
Xingsheng Gu

Clustering has been widely used in different fields of science, technology, social science, and so forth. In real world, numeric as well as categorical features are usually used to describe the data objects. Accordingly, many clustering methods can process datasets that are either numeric or categorical. Recently, algorithms that can handle the mixed data clustering problems have been developed. Affinity propagation (AP) algorithm is an exemplar-based clustering method which has demonstrated good performance on a wide variety of datasets. However, it has limitations on processing mixed datasets. In this paper, we propose a novel similarity measure for mixed type datasets and an adaptive AP clustering algorithm is proposed to cluster the mixed datasets. Several real world datasets are studied to evaluate the performance of the proposed algorithm. Comparisons with other clustering algorithms demonstrate that the proposed method works well not only on mixed datasets but also on pure numeric and categorical datasets.


Author(s):  
Yi-Hui Chen ◽  
Eric Jui-Lin Lu ◽  
Ya-Wen Cheng

Most clustering algorithms build disjoint clusters. However, clusters might be overlapped because documents may belong to two or more categories in the real world. For example, a paper discussing the Apple Watch may be categorized into either 3C, Fashion, or even Clothing and Shoes. Therefore, overlapping clustering algorithms have been studied such that a resource can be assigned to one or more clusters. Formal Concept Analysis (FCA), which has many practical applications in information science, has been used in disjoin clustering, but has not been studied in overlapping clustering. To make overlapping clustering possible by using FCA, we propose an approach, including two types of transformation. From the experimental results, it shows that the proposed fuzzy overlapping clustering performed more efficiently than existing overlapping clustering methods. The positive results confirm the feasibility of the proposed scheme used in overlapping clustering. Also, it can be used in applications such as recommendation systems.


Symmetry ◽  
2022 ◽  
Vol 14 (1) ◽  
pp. 60
Author(s):  
Kun Gao ◽  
Hassan Ali Khan ◽  
Wenwen Qu

Density clustering has been widely used in many research disciplines to determine the structure of real-world datasets. Existing density clustering algorithms only work well on complete datasets. In real-world datasets, however, there may be missing feature values due to technical limitations. Many imputation methods used for density clustering cause the aggregation phenomenon. To solve this problem, a two-stage novel density peak clustering approach with missing features is proposed: First, the density peak clustering algorithm is used for the data with complete features, while the labeled core points that can represent the whole data distribution are used to train the classifier. Second, we calculate a symmetrical FWPD distance matrix for incomplete data points, then the incomplete data are imputed by the symmetrical FWPD distance matrix and classified by the classifier. The experimental results show that the proposed approach performs well on both synthetic datasets and real datasets.


2012 ◽  
Vol 21 (04) ◽  
pp. 1240018 ◽  
Author(s):  
NICOLAS TSAPATSOULIS ◽  
OLGA GEORGIOU

The continuous increase in demand for new products and services on the market brought the need for systematic improvement of recommendation technologies. Recommender systems proved to be the answer to the data overload problem and an advantage for e-business. Nevertheless, challenges that recommender systems face, like sparsity and scalability, affect their performance in real-world situations where both the number of users and items are high and item rating is infrequent. In this article we propose a cluster based recommendation approach using genetic algorithms. Users are grouped into clusters based on their past choices and preferences and receive recommendations from the other cluster members with the aid of an innovative recommendation scheme called Top-Nvoted items. Similarity between users is computed using the max_norm Pearson coefficient. This is a modified form of the widely used Pearson coefficient and it is used to prevent very active users dominating recommendations. We compare our approach with five well established recommendation methods with the aid of three different datasets. These datasets vary in terms of the number of users, the number of items, and the sparsity of ratings. As a result important conclusions are drawn about the efficiency of each method with respect to scalability and dataset's sparsity.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Anna Magdalena Korzeniowska

AbstractSocial expenditure plays an important role in European Union (EU) countries. It improves the lives of citizens whose welfare is endangered due to poverty or illness. However, social expenditure represents a considerable share of the budgets of EU member states. Despite evident similarities in their levels of development, EU countries show apparent differences in social expenditure levels. Therefore, this work aims to determine the similarities and differences between EU countries in this regard. The analysis uses clustering methods, such as hierarchical cluster analysis and the k-means, to divide countries into homogeneous groups. The research demonstrates significant differences between EU countries in the years 2008–2018, which resulted in a low number of objects (countries) in the identified groups. In the case of 6 out of 28 countries, it was not possible to assign them to any group. The research proves that EU countries should take more care when organising their social policy, taking into consideration cultural and social factors.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yuan Zhao ◽  
Zhao-Yu Fang ◽  
Cui-Xiang Lin ◽  
Chao Deng ◽  
Yun-Pei Xu ◽  
...  

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.


Sign in / Sign up

Export Citation Format

Share Document