Unsupervised machine learning and pandemics spread: the case of COVID-19

Epidemics have severe impacts on people's health. The COVID-19 has infected more than 3 million people in 3 months. In this work, we explore the use of unsupervised machine learning to evaluate and monitor the disease spread worldwide in three points in time: January, February, and March of 2020. Besides the features related to the disease spread, we consider HDI, population density, and age structure. We define the number of clusters using the elbow and agglomerative clustering methods, then implement and evaluate the k-means algorithm with 3, 4, and 5 clusters. We conclude that four clusters better represent the data, analyze the clusters over time, and discuss the impacts on each depending on the measures adopted.

Download Full-text

Machine Learning Applications and Optimization of Clustering Methods Improve the Selection of Descriptors in Blackberry Germplasm Banks

Plants ◽

10.3390/plants10020247 ◽

2021 ◽

Vol 10 (2) ◽

pp. 247

Author(s):

Juan Camilo Henao-Rojas ◽

María Gladis Rosero-Alpala ◽

Carolina Ortiz-Muñoz ◽

Carlos Enrique Velásquez-Arroyo ◽

William Alfonso Leon-Rueda ◽

...

Keyword(s):

Machine Learning ◽

Support Vector ◽

P Value ◽

Clustering Methods ◽

Agglomerative Clustering ◽

Discriminating Power ◽

Hierarchical Agglomerative Clustering ◽

Machine Learning Applications ◽

Germplasm Banks ◽

Selection Of

Machine learning (ML) and its multiple applications have comparative advantages for improving the interpretation of knowledge on different agricultural processes. However, there are challenges that impede proper usage, as can be seen in phenotypic characterizations of germplasm banks. The objective of this research was to test and optimize different analysis methods based on ML for the prioritization and selection of morphological descriptors of Rubus spp. 55 descriptors were evaluated in 26 genotypes and the weight of each one and its ability to discriminating capacity was determined. ML methods as random forest (RF), support vector machines, in the linear and radial forms, and neural networks were optimized and compared. Subsequently, the results were validated with two discriminating methods and their variants: hierarchical agglomerative clustering and K-means. The results indicated that RF presented the highest accuracy (0.768) of the methods evaluated, selecting 11 descriptors based on the purity (Gini index), importance, number of connected trees, and significance (p value < 0.05). Additionally, K-means method with optimized descriptors based on RF had greater discriminating power on Rubus spp., accessions according to evaluated statistics. This study presents one application of ML for the optimization of specific morphological variables for plant germplasm bank characterization.

Download Full-text

134. Derivation of novel phenotypes of outpatient pediatrician prescribing patterns

Open Forum Infectious Diseases ◽

10.1093/ofid/ofaa439.179 ◽

2020 ◽

Vol 7 (Supplement_1) ◽

pp. S78-S79

Author(s):

Joshua C Herigon ◽

Jonathan Hatoun ◽

Louis Vernacchio

Keyword(s):

Machine Learning ◽

Antibiotic Prescription ◽

Optimal Number ◽

Prescribing Patterns ◽

Antibiotic Prescribing ◽

Prescribing Practices ◽

Number Of Clusters ◽

Unsupervised Machine Learning ◽

Individual Clinician ◽

Optimal Number Of Clusters

Abstract Background Antibiotics are the most commonly prescribed drugs for children with estimates that 30%-50% of outpatient antibiotic prescriptions are inappropriate. Most analyses of outpatient antibiotic prescribing practices do not examine patterns within individual clinicians’ prescribing practices. We sought to derive unique phenotypes of outpatient antibiotic prescribing practices using an unsupervised machine learning clustering algorithm. Methods We extracted diagnoses and prescribing data on all problem-focused visits with a physician or nurse practitioner between 6/11/2018 – 12/11/2018 for a state-wide association of pediatric practices across Massachusetts. Clinicians with fewer than 100 encounters were excluded. The proportion of encounters resulting in an antibiotic prescription were calculated. Proportions were stratified by diagnoses: otitis media (OM), pharyngitis, pneumonia (PNA), sinusitis, skin & soft tissue infection (SSTI), and urinary tract infection (UTI). We then applied consensus k-means clustering, a form of unsupervised machine learning, across all included clinicians to create clusters (or phenotypes) based on their prescribing rates for these 6 conditions. A scree plot was used to determine the optimal number of clusters. Results A total of 431 clinicians at 77 practices with 234,288 problem-focused visits were included (Table 1). Overall, 42,441 visits (18%) resulted in an antibiotic prescription. Individual clinician prescribing proportions ranged from 5% of visits up to 44%. The optimal number of clusters was determined to be four (designated alpha, beta, gamma, delta). Antibiotic prescribing rates were similar for each phenotype across AOM, pharyngitis, and pneumonia but differed substantially for sinusitis, SSTI, and UTI (Figure 1). The beta phenotype had the highest median rates of prescribing across all conditions while the delta phenotype had the lowest median prescribing rates except for UTI. Table 1. Patient demographics and clinician characteristics Figure 1. Novel phenotypes of antibiotic prescribing practices across six common conditions Conclusion Antibiotic prescribing varies by both condition and individual clinician. Clustering algorithms can be used to derive phenotypic antibiotic prescribing practices. Antimicrobial stewardship efforts may have a higher impact if tailored by antibiotic prescribing phenotype. Disclosures All Authors: No reported disclosures

Download Full-text

A Pattern New in Every Moment: The Temporal Clustering of Markets for Crude Oil, Refined Fuels, and Other Commodities

Energies ◽

10.3390/en14196099 ◽

2021 ◽

Vol 14 (19) ◽

pp. 6099

Author(s):

James Ming Chen ◽

Mobeen Ur Rehman

Keyword(s):

Machine Learning ◽

Crude Oil ◽

Temporal Dynamics ◽

Commodity Markets ◽

Conditional Volatility ◽

Critical Periods ◽

Clustering Methods ◽

Unsupervised Machine Learning ◽

Temporal Clustering ◽

September 11 Terrorist Attacks

The identification of critical periods and business cycles contributes significantly to the analysis of financial markets and the macroeconomy. Financialization and cointegration place a premium on the accurate recognition of time-varying volatility in commodity markets, especially those for crude oil and refined fuels. This article seeks to identify critical periods in the trading of energy-related commodities as a step toward understanding the temporal dynamics of those markets. This article proposes a novel application of unsupervised machine learning. A suite of clustering methods, applied to conditional volatility forecasts by trading days and individual assets or asset classes, can identify critical periods in energy-related commodity markets. Unsupervised machine learning achieves this task without rules-based or subjective definitions of crises. Five clustering methods—affinity propagation, mean-shift, spectral, k-means, and hierarchical agglomerative clustering—can identify anomalous periods in commodities trading. These methods identified the financial crisis of 2008–2009 and the initial stages of the COVID-19 pandemic. Applied to four energy-related markets—Brent, West Texas intermediate, gasoil, and gasoline—the same methods identified additional periods connected to events such as the September 11 terrorist attacks and the 2003 Persian Gulf war. t-distributed stochastic neighbor embedding facilitates the visualization of trading regimes. Temporal clustering of conditional volatility forecasts reveals unusual financial properties that distinguish the trading of energy-related commodities during critical periods from trading during normal periods and from trade in other commodities in all periods. Whereas critical periods for all commodities appear to coincide with broader disruptions in demand for energy, critical periods unique to crude oil and refined fuels appear to arise from acute disruptions in supply. Extensions of these methods include the definition of bull and bear markets and the identification of recessions and recoveries in the real economy.

Download Full-text

Characterisation of Temporal Patterns in Step Count Behaviour from Smartphone App Data: An Unsupervised Machine Learning Approach

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph182111476 ◽

2021 ◽

Vol 18 (21) ◽

pp. 11476

Author(s):

Francesca Pontin ◽

Nik Lomax ◽

Graham Clarke ◽

Michelle A. Morris

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Activity Patterns ◽

Physical Activity Behaviour ◽

Temporal Variations ◽

Habitual Physical Activity ◽

Seasonal Activity ◽

Clustering Methods ◽

Unsupervised Machine Learning ◽

Activity Behaviour

The increasing ubiquity of smartphone data, with greater spatial and temporal coverage than achieved by traditional study designs, have the potential to provide insight into habitual physical activity patterns. This study implements and evaluates the utility of both K-means clustering and agglomerative hierarchical clustering methods in identifying weekly and yearlong physical activity behaviour trends. Characterising the demographics and choice of activity type within the identified clusters of behaviour. Across all seven clusters of seasonal activity behaviour identified, daylight saving was shown to play a key role in influencing behaviour, with increased activity in summer months. Investigation into weekly behaviours identified six clusters with varied roles, of weekday versus weekend, on the likelihood of meeting physical activity guidelines. Preferred type of physical activity likewise varied between clusters, with gender and age strongly associated with cluster membership. Key relationships are identified between weekly clusters and seasonal activity behaviour clusters, demonstrating how short-term behaviours contribute to longer-term activity patterns. Utilising unsupervised machine learning, this study demonstrates how the volume and richness of secondary app data can allow us to move away from aggregate measures of physical activity to better understand temporal variations in habitual physical activity behaviour.

Download Full-text

An On-Line Agglomerative Clustering Method for Nonstationary Data

Neural Computation ◽

10.1162/089976699300016755 ◽

1999 ◽

Vol 11 (2) ◽

pp. 521-540 ◽

Cited By ~ 41

Author(s):

Isaac David Guedalia ◽

Mickey London ◽

Michael Werman

Keyword(s):

Clustering Algorithm ◽

Small Mass ◽

Good Representation ◽

Clustering Methods ◽

Agglomerative Clustering ◽

Number Of Clusters ◽

Local Distortion ◽

On Line ◽

Nonstationary Data ◽

Computationally Intensive

An on-line agglomerative clustering algorithm for nonstationary data is described. Three issues are addressed. The first regards the temporal aspects of the data. The clustering of stationary data by the proposed algorithm is comparable to the other popular algorithms tested (batch and on-line). The second issue addressed is the number of clusters required to represent the data. The algorithm provides an efficient framework to determine the natural number of clusters given the scale of the problem. Finally, the proposed algorithm implicitly minimizes the local distortion, a measure that takes into account clusters with relatively small mass. In contrast, most existing on-line clustering methods assume stationarity of the data. When used to cluster nonstationary data, these methods fail to generate a good representation. Moreover, most current algorithms are computationally intensive when determining the correct number of clusters. These algorithms tend to neglect clusters of small mass due to their minimization of the global distortion (Energy).

Download Full-text

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition

10.26434/chemrxiv.5513581.v1 ◽

2017 ◽

Author(s):

Sabrina Jaeger ◽

Simone Fulle ◽

Samo Turk

Keyword(s):

Machine Learning ◽

Language Processing ◽

Supervised Machine Learning ◽

Learning Approach ◽

Learning Approaches ◽

Unsupervised Machine Learning ◽

Feature Representations ◽

Machine Learning Approach ◽

The Individual ◽

Vector Representations

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.

Download Full-text

Analysis of the Bath Motion in the MM-SQC Dynamics Using Unsupervised Machine Learning Dimensionality Reduction Approaches: Principal Component Analysis

10.26434/chemrxiv.13332530 ◽

2020 ◽

Author(s):

Jiawei Peng ◽

Yu Xie ◽

Deping Hu ◽

Zhenggang Lan

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Collective Motion ◽

Principal Component ◽

Component Analysis ◽

Nonadiabatic Dynamics ◽

Trajectory Data ◽

Unsupervised Machine Learning ◽

Physical Knowledge ◽

Vibronic Couplings

The system-plus-bath model is an important tool to understand nonadiabatic dynamics for large molecular systems. The understanding of the collective motion of a huge number of bath modes is essential to reveal their key roles in the overall dynamics. We apply the principal component analysis (PCA) to investigate the bath motion based on the massive data generated from the MM-SQC (symmetrical quasi-classical dynamics method based on the Meyer-Miller mapping Hamiltonian) nonadiabatic dynamics of the excited-state energy transfer dynamics of Frenkel-exciton model. The PCA method clearly clarifies that two types of bath modes, which either display the strong vibronic couplings or have the frequencies close to electronic transition, are very important to the nonadiabatic dynamics. These observations are fully consistent with the physical insights. This conclusion is obtained purely based on the PCA understanding of the trajectory data, without the large involvement of pre-defined physical knowledge. The results show that the PCA approach, one of the simplest unsupervised machine learning methods, is very powerful to analyze the complicated nonadiabatic dynamics in condensed phase involving many degrees of freedom.

Download Full-text

Gender and age structure as well as body weight of partridge (Perdix perdix L.) during periods of high and low population density in the Lublin Upland

Annals of Warsaw University of Life Sciences - SGGW - Animal Science ◽

10.22630/aas.2017.56.1.8 ◽

2017 ◽

Vol 56 (1) ◽

pp. 65-74

Author(s):

MARIAN FLIS ◽

MAREK PANEK

Keyword(s):

Body Weight ◽

Population Density ◽

Age Structure ◽

Perdix Perdix ◽

Gender And Age

Download Full-text

An analysis of COVID-19 clusters in India

BMC Public Health ◽

10.1186/s12889-021-10491-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Pooja Sengupta ◽

Bhaswati Ganguli ◽

Sugata SenRoy ◽

Aditya Chatterjee

Keyword(s):

Public Health ◽

Cluster Analysis ◽

Population Density ◽

Case Studies ◽

Compartment Model ◽

Health Interventions ◽

Disease Spread ◽

Second Phase ◽

Public Health Interventions ◽

Homogeneous Groups

Abstract Background In this study we cluster the districts of India in terms of the spread of COVID-19 and related variables such as population density and the number of specialty hospitals. Simulation using a compartment model is used to provide insight into differences in response to public health interventions. Two case studies of interest from Nizamuddin and Dharavi provide contrasting pictures of the success in curbing spread. Methods A cluster analysis of the worst affected districts in India provides insight about the similarities between them. The effects of public health interventions in flattening the curve in their respective states is studied using the individual contact SEIQHRF model, a stochastic individual compartment model which simulates disease prevalence in the susceptible, infected, recovered and fatal compartments. Results The clustering of hotspot districts provide homogeneous groups that can be discriminated in terms of number of cases and related covariates. The cluster analysis reveal that the distribution of number of COVID-19 hospitals in the districts does not correlate with the distribution of confirmed COVID-19 cases. From the SEIQHRF model for Nizamuddin we observe in the second phase the number of infected individuals had seen a multitudinous increase in the states where Nizamuddin attendees returned, increasing the risk of the disease spread. However, the simulations reveal that implementing administrative interventions, flatten the curve. In Dharavi, through tracing, tracking, testing and treating, massive breakout of COVID-19 was brought under control. Conclusions The cluster analysis performed on the districts reveal homogeneous groups of districts that can be ranked based on the burden placed on the healthcare system in terms of number of confirmed cases, population density and number of hospitals dedicated to COVID-19 treatment. The study rounds up with two important case studies on Nizamuddin basti and Dharavi to illustrate the growth curve of COVID-19 in two very densely populated regions in India. In the case of Nizamuddin, the study showed that there was a manifold increase in the risk of infection. In contrast it is seen that there was a rapid decline in the number of cases in Dharavi within a span of about one month.

Download Full-text

An unsupervised machine-learning checkpoint-restart algorithm using Gaussian mixtures for particle-in-cell simulations

Journal of Computational Physics ◽

10.1016/j.jcp.2021.110185 ◽

2021 ◽

Vol 436 ◽

pp. 110185

Author(s):

G. Chen ◽

L. Chacón ◽

T.B. Nguyen

Keyword(s):

Machine Learning ◽

Gaussian Mixtures ◽

Unsupervised Machine Learning ◽

Particle In Cell

Download Full-text