Optimisation and parallelisation of the partitioning around medoids function in R

Author(s):  
Michal Piotrowski ◽  
Thorsten Forster ◽  
Bartosz Dobrezelecki ◽  
Terence M. Sloan ◽  
Lawrence Mitchell ◽  
...  
Author(s):  
Hyeuk Kim

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gregoire Preud’homme ◽  
Kevin Duarte ◽  
Kevin Dalleau ◽  
Claire Lacomblez ◽  
Emmanuel Bresso ◽  
...  

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.


2021 ◽  
Vol 27 (12) ◽  
pp. 2679-2697
Author(s):  
Lyudmila E. ROMANOVA ◽  
Anna L. SABININA ◽  
Andrei I. CHUKANOV ◽  
Dar’ya M. KORSHUNOVA

Subject. This article deals with the particularities of the development of housing mortgage lending in the regions of Russia. Objectives. The article aims to substantiate the need for clustering of territorial entities by level of development of mortgage housing lending in Russia and test the most effective algorithm for mortgage clustering of regions. Methods. For the study, we used a systems approach, including scientific abstraction, analysis and synthesis, and statistical methods of data analysis. The algorithm k-medoids – Partitioning Around Medoids (PAM) was also used. Results. Based on the results of the study of regional statistics of the Russian Federation, the article reveals a significant asymmetry in the values of key socioeconomic indices that determine the level and dynamics of housing mortgages in the regions. This necessitates the clustering of territorial entities according to the level of development of mortgage housing lending in the country. To take into account the impact of various local conditions in assessing the prospects for the development of regional housing mortgages, the article proposes an indicator, namely, the integral regional mortgage affordability index. On its basis, in accordance with the selected clustering procedure, the article identifies five mortgage clusters in Russia and identifies their representative regions. Conclusions. Based on the analysis of the specificity of the development of regional mortgages in the Tula Oblast, taking into account the implementation of the target State programme, the article concludes that it is necessary to improve the mechanisms for financing regional mortgage programmes and justifies the need to develop differentiated programmes for the development of housing mortgages in groups of Russian regions.


Author(s):  
Alexis Caro ◽  
Fernando Gimeno ◽  
Antoine Rabatel ◽  
Thomas Condom ◽  
Jean Carlos Ruiz

Utilizando variables topográficas y climáticas, presentamos clústeres glaciares en los Andes chilenos (17.6-55.4°S), donde se ejecutó el algoritmo de aprendizaje automático no supervisado Partitioning Around Medoids (PAM). Los resultados clasificaron 23,974 glaciares dentro de trece clústeres, que muestran condiciones específicas en términos de cantidades anuales y mensuales de precipitación, temperatura y radiación solar. En los Andes secos, los valores medios anuales de cinco clústeres glaciares (C1-C5) muestran una diferencia de precipitación y temperatura de hasta 400 mm (29 y 33°S) y 8°C (33°S), con una resta de elevación promedio de 1800 m entre glaciares clústeres C1 y C5 (18 a 34°S). Mientras que en los Andes húmedos las mayores diferencias se observaron en la latitud del Campo de Hielo Patagónico Sur (50°S), donde los valores medios anuales de precipitación y temperatura muestran una precipitación marítima por encima de 3700 mm/año (C12), donde el aire húmedo occidental juega un papel importante, y por debajo de 1000 mm/año al este del Campo de Hielo Patagónico Sur (C10), con diferencias de temperatura cercanas a 4°C y una resta de elevación promedio de 500 m. Esta clasificación confirma que los glaciares chilenos no pueden agruparse solo latitudinalmente, contribuyendo a una mejor comprensión de los cambios recientes en el volumen de los glaciares a escala regional.


2021 ◽  
Author(s):  
Felipe M. Moreno ◽  
Eduardo A. Tannuri

Abstract The methodology described in this paper is used to reduce a large set of combined wind, waves, and currents to a smaller set that still represents well enough the desired site for ship maneuvering simulations. This is achieved by running fast-time simulations for the entire set of environmental conditions and recording the vessel’s drifting time-series while it is controlled by an automatic-pilot based on a line-of-sight algorithm. The cases are then grouped considering how similar the vessel’s drifting time-series are, and one environmental condition is selected to represent each group found by the cluster analysis. The measurement of dissimilarity between the time-series is made by application of Dynamic Time Warping and the Cluster Analysis is made by the combination of Partitioning Around Medoids algorithm and the Silhouette Method. Validation is made by maneuvering simulations made with a Second Deck Officer.


2006 ◽  
Vol 5 (6) ◽  
pp. 1102-1105 ◽  
Author(s):  
D.K. Swami ◽  
R.C. Jain .

2014 ◽  
Vol 14 (1) ◽  
pp. 7-21
Author(s):  
Jacek Szanduła

Abstract The paper develops the concept of harnessing data classification methods to recognize patterns in stock prices. The author defines a formation as a pattern vector describing the financial instrument. Elements of such a vector can be related to the stock price as well as sales volume and other characteristics of the financial instrument. The study uses data concerning selected companies listed on the stock exchange in New York. It takes into account a number of variables that describe the behavior of prices and volume, both in the short and long term. Partitioning around medoids method has been used for data classification (for pattern recognition). An evaluation of the possibility of using certain formations for practical purposes has also been presented.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. e24150-e24150
Author(s):  
Francesco Pantano ◽  
Paolo Manca ◽  
Grazia Armento ◽  
Tea Zeppola ◽  
Angelo Onorato ◽  
...  

e24150 Background: A large proportion of patients with cancer suffer from Breakthrough cancer pain (BTcP). Several unmet clinical needs concerning BTcP treatment, like optimal opioids dosage, are being investigated. We explored with an unsupervised learning algorithm whether distinct subtypes of BTcP exist and whether they can provide new insights into clinical practice. Methods: We used partitioning around medoids algorithm on a large dataset of patients with BTcP previously collected by the IOPS group in order to identify possible subgroups of BTcP; the input of the algorithm consisted of different BTcP features, like its duration or its intensity. Silhouette statistics was used to pick an optimal number of clusters. Resulting clusters were analyzed in terms of BTcP therapy satisfaction, clinical features and usage of basal pain and rapid onset opioids. Opioids dosages were converted to a unique scale and BTcP-opioids-to-basal-pain-opioids ratio (OpR) was calculated for each patient. Polynomial logistic regression was used to catch non-linear relationships between therapy satisfaction and opioids usage. Results: The cohort comprised 4016 patients with controlled basal pain and suffering from BTcP. Our algorithm identified 12 distinct BTcP clusters. Optimal OpRs differed across the clusters, ranging from 15% to 50%. In the whole cohort, OpR was more clearly associated with therapy satisfaction compared with BTcP opioids or basal pain opioids alone. The majority of the clusters were linked to peculiar association of certain drugs with therapy satisfaction or dissatisfaction. A free online tool was created for new patients cluster computation ( https://mancapaolo.shinyapps.io/UCBM_BTcPclusters/ ) in order to validate these clusters in future studies and to provide a possible, handy indications for personalized BTcP therapy. Conclusions: This work proposes a classification for BTcP and identifies subgroups of patients with unique efficacy of different pain medications. This work supports the theory that the optimal dose of BTcP opioids depends on the dose of basal opioids and identifies novel values, possibly useful for future trials.


Sign in / Sign up

Export Citation Format

Share Document