Optimisation and parallelisation of the partitioning around medoids function in R

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Scientific Reports ◽

10.1038/s41598-021-83340-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Gregoire Preud’homme ◽

Kevin Duarte ◽

Kevin Dalleau ◽

Claire Lacomblez ◽

Emmanuel Bresso ◽

...

Keyword(s):

Hierarchical Clustering ◽

Latent Class ◽

Latent Class Model ◽

Real Life ◽

Heterogeneous Data ◽

Mixed Data ◽

Categorical Variables ◽

Clustering Methods ◽

Model Based ◽

Partitioning Around Medoids

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Download Full-text

Revealing of the specificity of mortgage housing lending based on the results of clustering of regions

Finance and Credit ◽

10.24891/fc.27.12.2679 ◽

2021 ◽

Vol 27 (12) ◽

pp. 2679-2697

Author(s):

Lyudmila E. ROMANOVA ◽

Anna L. SABININA ◽

Andrei I. CHUKANOV ◽

Dar’ya M. KORSHUNOVA

Keyword(s):

Systems Approach ◽

Mortgage Lending ◽

Effective Algorithm ◽

Russian Regions ◽

Local Conditions ◽

Partitioning Around Medoids ◽

Analysis And Synthesis ◽

The Russian Federation ◽

The Impact ◽

Regions Of Russia

Subject. This article deals with the particularities of the development of housing mortgage lending in the regions of Russia. Objectives. The article aims to substantiate the need for clustering of territorial entities by level of development of mortgage housing lending in Russia and test the most effective algorithm for mortgage clustering of regions. Methods. For the study, we used a systems approach, including scientific abstraction, analysis and synthesis, and statistical methods of data analysis. The algorithm k-medoids – Partitioning Around Medoids (PAM) was also used. Results. Based on the results of the study of regional statistics of the Russian Federation, the article reveals a significant asymmetry in the values of key socioeconomic indices that determine the level and dynamics of housing mortgages in the regions. This necessitates the clustering of territorial entities according to the level of development of mortgage housing lending in the country. To take into account the impact of various local conditions in assessing the prospects for the development of regional housing mortgages, the article proposes an indicator, namely, the integral regional mortgage affordability index. On its basis, in accordance with the selected clustering procedure, the article identifies five mortgage clusters in Russia and identifies their representative regions. Conclusions. Based on the analysis of the specificity of the development of regional mortgages in the Tula Oblast, taking into account the implementation of the target State programme, the article concludes that it is necessary to improve the mechanisms for financing regional mortgage programmes and justifies the need to develop differentiated programmes for the development of housing mortgages in groups of Russian regions.

Download Full-text

Identificación de clústeres glaciares a lo largo de los Andes chilenos usando variables topoclimáticas

Investigaciones Geográficas ◽

10.5354/0719-5370.2020.59009 ◽

2020 ◽

pp. 119

Author(s):

Alexis Caro ◽

Fernando Gimeno ◽

Antoine Rabatel ◽

Thomas Condom ◽

Jean Carlos Ruiz

Keyword(s):

Partitioning Around Medoids

Utilizando variables topográficas y climáticas, presentamos clústeres glaciares en los Andes chilenos (17.6-55.4°S), donde se ejecutó el algoritmo de aprendizaje automático no supervisado Partitioning Around Medoids (PAM). Los resultados clasificaron 23,974 glaciares dentro de trece clústeres, que muestran condiciones específicas en términos de cantidades anuales y mensuales de precipitación, temperatura y radiación solar. En los Andes secos, los valores medios anuales de cinco clústeres glaciares (C1-C5) muestran una diferencia de precipitación y temperatura de hasta 400 mm (29 y 33°S) y 8°C (33°S), con una resta de elevación promedio de 1800 m entre glaciares clústeres C1 y C5 (18 a 34°S). Mientras que en los Andes húmedos las mayores diferencias se observaron en la latitud del Campo de Hielo Patagónico Sur (50°S), donde los valores medios anuales de precipitación y temperatura muestran una precipitación marítima por encima de 3700 mm/año (C12), donde el aire húmedo occidental juega un papel importante, y por debajo de 1000 mm/año al este del Campo de Hielo Patagónico Sur (C10), con diferencias de temperatura cercanas a 4°C y una resta de elevación promedio de 500 m. Esta clasificación confirma que los glaciares chilenos no pueden agruparse solo latitudinalmente, contribuyendo a una mejor comprensión de los cambios recientes en el volumen de los glaciares a escala regional.

Download Full-text

Clustering Applied to Large Sets of Environmental Conditions for Selecting Typical Scenarios for Ship Maneuvering Real-Time Simulations

10.1115/omae2021-62875 ◽

2021 ◽

Author(s):

Felipe M. Moreno ◽

Eduardo A. Tannuri

Keyword(s):

Time Series ◽

Cluster Analysis ◽

Environmental Conditions ◽

Wind Waves ◽

Fast Time ◽

Large Set ◽

Ship Maneuvering ◽

Partitioning Around Medoids ◽

Waves And Currents ◽

Maneuvering Simulations

Abstract The methodology described in this paper is used to reduce a large set of combined wind, waves, and currents to a smaller set that still represents well enough the desired site for ship maneuvering simulations. This is achieved by running fast-time simulations for the entire set of environmental conditions and recording the vessel’s drifting time-series while it is controlled by an automatic-pilot based on a line-of-sight algorithm. The cases are then grouped considering how similar the vessel’s drifting time-series are, and one environmental condition is selected to represent each group found by the cluster analysis. The measurement of dissimilarity between the time-series is made by application of Dynamic Time Warping and the Cluster Analysis is made by the combination of Partitioning Around Medoids algorithm and the Silhouette Method. Validation is made by maneuvering simulations made with a Second Deck Officer.

Download Full-text

PAMC: Partitioning Around Medoids for Classification

Information Technology Journal ◽

10.3923/itj.2006.1102.1105 ◽

2006 ◽

Vol 5 (6) ◽

pp. 1102-1105 ◽

Cited By ~ 1

Author(s):

D.K. Swami ◽

R.C. Jain .

Keyword(s):

Partitioning Around Medoids

Download Full-text

Clustering by partitioning around medoids using distance-based similarity measures on interval-scaled variables

Nigerian Journal of Technological Development ◽

10.4314/njtd.v15i1.1 ◽

2018 ◽

Vol 15 (1) ◽

pp. 1

Author(s):

D.L. Nkweteyim

Keyword(s):

Similarity Measures ◽

Partitioning Around Medoids

Download Full-text

Forecasting Changes in Stock Prices on the Basis of Patterns Identified with the Use of Data Classification Methods

Folia Oeconomica Stetinensia ◽

10.2478/foli-2014-0101 ◽

2014 ◽

Vol 14 (1) ◽

pp. 7-21

Author(s):

Jacek Szanduła

Keyword(s):

New York ◽

Stock Prices ◽

Stock Price ◽

Stock Exchange ◽

Data Classification ◽

Classification Methods ◽

Financial Instrument ◽

Partitioning Around Medoids ◽

Use Of Data ◽

Short And Long Term

Abstract The paper develops the concept of harnessing data classification methods to recognize patterns in stock prices. The author defines a formation as a pattern vector describing the financial instrument. Elements of such a vector can be related to the stock price as well as sales volume and other characteristics of the financial instrument. The study uses data concerning selected companies listed on the stock exchange in New York. It takes into account a number of variables that describe the behavior of prices and volume, both in the short and long term. Partitioning around medoids method has been used for data classification (for pattern recognition). An evaluation of the possibility of using certain formations for practical purposes has also been presented.

Download Full-text

Implementation of hybrid clustering based on partitioning around medoids algorithm and divisive analysis on human Papillomavirus DNA

10.1063/1.4978974 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mentari Dian Arimbi ◽

Alhadi Bustamam ◽

Dian Lestari

Keyword(s):

Human Papillomavirus ◽

Hybrid Clustering ◽

Partitioning Around Medoids ◽

Papillomavirus Dna

Download Full-text

Breakthrough cancer pain clinical features and differential opioids response: A machine learning approach in 4,016 cancer patients of IOPS-MS study.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.15_suppl.e24150 ◽

2020 ◽

Vol 38 (15_suppl) ◽

pp. e24150-e24150

Author(s):

Francesco Pantano ◽

Paolo Manca ◽

Grazia Armento ◽

Tea Zeppola ◽

Angelo Onorato ◽

...

Keyword(s):

Cancer Pain ◽

Clinical Features ◽

Learning Algorithm ◽

Optimal Dose ◽

Rapid Onset ◽

Optimal Number ◽

Breakthrough Cancer Pain ◽

Partitioning Around Medoids ◽

Linear Relationships ◽

Practice Methods

e24150 Background: A large proportion of patients with cancer suffer from Breakthrough cancer pain (BTcP). Several unmet clinical needs concerning BTcP treatment, like optimal opioids dosage, are being investigated. We explored with an unsupervised learning algorithm whether distinct subtypes of BTcP exist and whether they can provide new insights into clinical practice. Methods: We used partitioning around medoids algorithm on a large dataset of patients with BTcP previously collected by the IOPS group in order to identify possible subgroups of BTcP; the input of the algorithm consisted of different BTcP features, like its duration or its intensity. Silhouette statistics was used to pick an optimal number of clusters. Resulting clusters were analyzed in terms of BTcP therapy satisfaction, clinical features and usage of basal pain and rapid onset opioids. Opioids dosages were converted to a unique scale and BTcP-opioids-to-basal-pain-opioids ratio (OpR) was calculated for each patient. Polynomial logistic regression was used to catch non-linear relationships between therapy satisfaction and opioids usage. Results: The cohort comprised 4016 patients with controlled basal pain and suffering from BTcP. Our algorithm identified 12 distinct BTcP clusters. Optimal OpRs differed across the clusters, ranging from 15% to 50%. In the whole cohort, OpR was more clearly associated with therapy satisfaction compared with BTcP opioids or basal pain opioids alone. The majority of the clusters were linked to peculiar association of certain drugs with therapy satisfaction or dissatisfaction. A free online tool was created for new patients cluster computation ( https://mancapaolo.shinyapps.io/UCBM_BTcPclusters/ ) in order to validate these clusters in future studies and to provide a possible, handy indications for personalized BTcP therapy. Conclusions: This work proposes a classification for BTcP and identifies subgroups of patients with unique efficacy of different pain medications. This work supports the theory that the optimal dose of BTcP opioids depends on the dose of basal opioids and identifies novel values, possibly useful for future trials.

Download Full-text