A machine learning approach to medical data identification through principal component analysis

Author(s):  
Lorenzo E. Jaques ◽  
Arthur C. Depoian ◽  
Dong Xie ◽  
Colleen P. Bailey ◽  
Parthasarathy Guturu
2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S331-S332
Author(s):  
Nusrat J Epsi ◽  
John H Powers ◽  
David A Lindholm ◽  
David A Lindholm ◽  
Alison Helfrich ◽  
...  

Abstract Background The novel coronavirus disease 2019 (COVID-19) pandemic remains a global challenge. Accurate COVID-19 prognosis remains an important aspect of clinical management. While many prognostic systems have been proposed, most are derived from analyses of individual symptoms or biomarkers. Here, we take a machine learning approach to first identify discrete clusters of early stage-symptoms which may delineate groups with distinct symptom phenotypes. We then sought to identify whether these groups correlate with subsequent disease severity. Methods The Epidemiology, Immunology, and Clinical Characteristics of Emerging Infectious Diseases with Pandemic Potential (EPICC) study is a longitudinal cohort study with data and biospecimens collected from nine military treatment facilities over 1 year of follow-up. Demographic and clinical characteristics were measured with interviews and electronic medical record review. Early symptoms by organ-domain were measured by FLU-PRO-plus surveys collected for 14 days post-enrollment, with surveys completed a median 14.5 (Interquartile Range, IQR = 13) days post-symptom onset. Using these FLU-PRO-plus responses, we applied principal component analysis followed by unsupervised machine learning algorithm k-means to identify groups with distinct clusters of symptoms. We then fit multivariate logistic regression models to determine how these early-symptom clusters correlated with hospitalization risk after controlling for age, sex, race, and obesity. Results Using SARS-CoV-2 positive participants (n = 1137) from the EPICC cohort (Figure 1), we transformed reported symptoms into domains and identified three groups of participants with distinct clusters of symptoms. Logistic regression demonstrated that cluster-2 was associated with an approximately three-fold increased odds [3.01 (95% CI: 2-4.52); P < 0.001] of hospitalization which remained significant after controlling for other factors [2.97 (95% CI: 1.88-4.69); P < 0.001]. (A) Baseline characteristics of SARS-CoV-2 positive participants. (B) Heatmap comparing FLU-PRO response in each participant. (C) Principal component analysis followed by k-means clustering identified three groups of participants. (D) Crude and adjusted association of identified cluster with hospitalization. Conclusion Our findings have identified three distinct groups with early-symptom phenotypes. With further validation of the clusters’ significance, this tool could be used to improve COVID-19 prognosis in a precision medicine framework and may assist in patient triaging and clinical decision-making. Disclaimer Disclosures David A. Lindholm, MD, American Board of Internal Medicine (Individual(s) Involved: Self): Member of Auxiliary R&D Infectious Disease Item-Writer Task Force. No financial support received. No exam questions will be disclosed ., Other Financial or Material Support Ryan C. Maves, MD, EMD Serono (Advisor or Review Panel member)Heron Therapeutics (Advisor or Review Panel member) Simon Pollett, MBBS, Astra Zeneca (Other Financial or Material Support, HJF, in support of USU IDCRP, funded under a CRADA to augment the conduct of an unrelated Phase III COVID-19 vaccine trial sponsored by AstraZeneca as part of USG response (unrelated work))


Author(s):  
Hyeuk Kim

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.


2020 ◽  
Author(s):  
Jiawei Peng ◽  
Yu Xie ◽  
Deping Hu ◽  
Zhenggang Lan

The system-plus-bath model is an important tool to understand nonadiabatic dynamics for large molecular systems. The understanding of the collective motion of a huge number of bath modes is essential to reveal their key roles in the overall dynamics. We apply the principal component analysis (PCA) to investigate the bath motion based on the massive data generated from the MM-SQC (symmetrical quasi-classical dynamics method based on the Meyer-Miller mapping Hamiltonian) nonadiabatic dynamics of the excited-state energy transfer dynamics of Frenkel-exciton model. The PCA method clearly clarifies that two types of bath modes, which either display the strong vibronic couplings or have the frequencies close to electronic transition, are very important to the nonadiabatic dynamics. These observations are fully consistent with the physical insights. This conclusion is obtained purely based on the PCA understanding of the trajectory data, without the large involvement of pre-defined physical knowledge. The results show that the PCA approach, one of the simplest unsupervised machine learning methods, is very powerful to analyze the complicated nonadiabatic dynamics in condensed phase involving many degrees of freedom.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
J. A. Camilleri ◽  
S. B. Eickhoff ◽  
S. Weis ◽  
J. Chen ◽  
J. Amunts ◽  
...  

AbstractWhile a replicability crisis has shaken psychological sciences, the replicability of multivariate approaches for psychometric data factorization has received little attention. In particular, Exploratory Factor Analysis (EFA) is frequently promoted as the gold standard in psychological sciences. However, the application of EFA to executive functioning, a core concept in psychology and cognitive neuroscience, has led to divergent conceptual models. This heterogeneity severely limits the generalizability and replicability of findings. To tackle this issue, in this study, we propose to capitalize on a machine learning approach, OPNMF (Orthonormal Projective Non-Negative Factorization), and leverage internal cross-validation to promote generalizability to an independent dataset. We examined its application on the scores of 334 adults at the Delis–Kaplan Executive Function System (D-KEFS), while comparing to standard EFA and Principal Component Analysis (PCA). We further evaluated the replicability of the derived factorization across specific gender and age subsamples. Overall, OPNMF and PCA both converge towards a two-factor model as the best data-fit model. The derived factorization suggests a division between low-level and high-level executive functioning measures, a model further supported in subsamples. In contrast, EFA, highlighted a five-factor model which reflects the segregation of the D-KEFS battery into its main tasks while still clustering higher-level tasks together. However, this model was poorly supported in the subsamples. Thus, the parsimonious two-factors model revealed by OPNMF encompasses the more complex factorization yielded by EFA while enjoying higher generalizability. Hence, OPNMF provides a conceptually meaningful, technically robust, and generalizable factorization for psychometric tools.


2020 ◽  
Author(s):  
Renata Silva ◽  
Daniel Oliveira ◽  
Davi Pereira Santos ◽  
Lucio F.D. Santos ◽  
Rodrigo Erthal Wilson ◽  
...  

Principal component analysis (PCA) is an efficient model for the optimization problem of finding d' axes of a subspace Rd' ⊆ Rd so that the mean squared distances from a given set R of points to the axes are minimal. Despite being steadily employed since 1901 in different scenarios, e.g., mechanics, PCA has become an important link in machine learning chained tasks, such as feature learning and AutoML designs. A frequent yet open issue that arises from supervised-based problems is how many PCA axes are required for the performance of machine learning constructs to be tuned. Accordingly, we investigate the behavior of six independent and uncoupled criteria for estimating the number of PCA axes, namely Scree-Plot %, Scree Plot Gap, Kaiser-Guttman, Broken-Stick, p-Score, and 2D. In total, we evaluate the performance of those approaches in 20 high dimensional datasets by using (i) four different classifiers, and (ii) a hypothesis test upon the reported F-Measures. Results indicate Broken-Stick and Scree-Plot % criteria consistently outperformed the competitors regarding supervised-based tasks, whereas estimators Kaiser-Guttman and Scree-Plot Gap delivered poor performances in the same scenarios.


2020 ◽  
Vol 36 (17) ◽  
pp. 4590-4598
Author(s):  
Robert Page ◽  
Ruriko Yoshida ◽  
Leon Zhang

Abstract Motivation Due to new technology for efficiently generating genome data, machine learning methods are urgently needed to analyze large sets of gene trees over the space of phylogenetic trees. However, the space of phylogenetic trees is not Euclidean, so ordinary machine learning methods cannot be directly applied. In 2019, Yoshida et al. introduced the notion of tropical principal component analysis (PCA), a statistical method for visualization and dimensionality reduction using a tropical polytope with a fixed number of vertices that minimizes the sum of tropical distances between each data point and its tropical projection. However, their work focused on the tropical projective space rather than the space of phylogenetic trees. We focus here on tropical PCA for dimension reduction and visualization over the space of phylogenetic trees. Results Our main results are 2-fold: (i) theoretical interpretations of the tropical principal components over the space of phylogenetic trees, namely, the existence of a tropical cell decomposition into regions of fixed tree topology; and (ii) the development of a stochastic optimization method to estimate tropical PCs over the space of phylogenetic trees using a Markov Chain Monte Carlo approach. This method performs well with simulation studies, and it is applied to three empirical datasets: Apicomplexa and African coelacanth genomes as well as sequences of hemagglutinin for influenza from New York. Availability and implementation Dataset: http://polytopes.net/Data.tar.gz. Code: http://polytopes.net/tropica_MCMC_codes.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document