scholarly journals Conditional t-SNE: more informative t-SNE embeddings

2020 ◽  
Author(s):  
Bo Kang ◽  
Darío García García ◽  
Jefrey Lijffijt ◽  
Raúl Santos-Rodríguez ◽  
Tijl De Bie

AbstractDimensionality reduction and manifold learning methods such as t-distributed stochastic neighbor embedding (t-SNE) are frequently used to map high-dimensional data into a two-dimensional space to visualize and explore that data. Going beyond the specifics of t-SNE, there are two substantial limitations of any such approach: (1) not all information can be captured in a single two-dimensional embedding, and (2) to well-informed users, the salient structure of such an embedding is often already known, preventing that any real new insights can be obtained. Currently, it is not known how to extract the remaining information in a similarly effective manner. We introduce conditional t-SNE (ct-SNE), a generalization of t-SNE that discounts prior information in the form of labels. This enables obtaining more informative and more relevant embeddings. To achieve this, we propose a conditioned version of the t-SNE objective, obtaining an elegant method with a single integrated objective. We show how to efficiently optimize the objective and study the effects of the extra parameter that ct-SNE has over t-SNE. Qualitative and quantitative empirical results on synthetic and real data show ct-SNE is scalable, effective, and achieves its goal: it allows complementary structure to be captured in the embedding and provided new insights into real data.

2006 ◽  
Vol 21 (22) ◽  
pp. 4511-4518
Author(s):  
J. DOUARI

We construct a set of noncommuting translation operators in two- and high-dimensional lattices. The algebras they close are w∞-algebras. The construction is based on the introduction of noncommuting elementary link operators which link two neighborhood sites in the lattice. This type of operators preserve the braiding nature of exotic particles living basically in two-dimensional space.


2020 ◽  
Author(s):  
Timothy Kunz ◽  
Lila Rieber ◽  
Shaun Mahony

ABSTRACTFew existing methods enable the visualization of relationships between regulatory genomic activities and genome organization as captured by Hi-C experimental data. Genome-wide Hi-C datasets are often displayed using “heatmap” matrices, but it is difficult to intuit from these heatmaps which biochemical activities are compartmentalized together. High-dimensional Hi-C data vectors can alternatively be projected onto three-dimensional space using dimensionality reduction techniques. The resulting three-dimensional structures can serve as scaffolds for projecting other forms of genomic information, thereby enabling the exploration of relationships between genome organization and various genome annotations. However, while three-dimensional models are contextually appropriate for chromatin interaction data, some analyses and visualizations may be more intuitively and conveniently performed in two-dimensional space.We present a novel approach to the visualization and analysis of chromatin organization based on the Self-Organizing Map (SOM). The SOM algorithm provides a two-dimensional manifold which adapts to represent the high dimensional chromatin interaction space. The resulting data structure can then be used to assess the relationships between regulatory genomic activities and chromatin interactions. For example, given a set of genomic coordinates corresponding to a given biochemical activity, the degree to which this activity is segregated or compartmentalized in chromatin interaction space can be intuitively visualized on the 2D SOM grid and quantified using Lorenz curve analysis. We demonstrate our approach for exploratory analysis of genome compartmentalization in a high-resolution Hi-C dataset from the human GM12878 cell line. Our SOM-based approach provides an intuitive visualization of the large-scale structure of Hi-C data and serves as a platform for integrative analyses of the relationships between various genomic activities and genome organization.


2021 ◽  
Author(s):  
David Chushig-Muzo ◽  
Cristina Soguero-Ruiz ◽  
Pablo de Miguel Bohoyo ◽  
Inmaculada Mora-Jiménez

Abstract Background: Nowadays, patients with chronic diseases such as diabetes and hypertension have reached alarming numbers worldwide. These diseases increase the risk of developing acute complications and involve a substantial economic burden and demand for health resources. The widespread adoption of Electronic Health Records (EHRs) is opening great opportunities for supporting decision-making. Nevertheless, data extracted from EHRs are complex (heterogeneous, high-dimensional and usually noisy), hampering the knowledge extraction with conventional approaches. Methods: We propose the use of the Denoising Autoencoder (DAE), a Machine Learning (ML) technique allowing to transform high-dimensional data into latent representations (LRs), thus addressing the main challenges with clinical data. We explore in this work how the combination of LRs with a visualization method can be used to map the patient data in a two-dimensional space, gaining knowledge about the distribution of patients with different chronic conditions. Furthermore, this representation can be also used to characterize the patient's health status evolution, which is of paramount importance in the clinical setting. Results: To obtain clinical LRs, we considered real-world data extracted from EHRs linked to the University Hospital of Fuenlabrada in Spain. Experimental results showed the great potential of DAEs to identify patients with clinical patterns linked to hypertension, diabetes and multimorbidity. The procedure allowed us to find patients with the same main chronic disease but different clinical characteristics. Thus, we identified two kinds of diabetic patients with differences in their drug therapy (insulin and non-insulin dependant), and also a group of women affected by hypertension and gestational diabetes. We also present a proof of concept for mapping the health status evolution of synthetic patients when considering the most significant diagnoses and drugs associated with chronic patients. Conclusions: Our results highlighted the value of ML techniques to extract clinical knowledge, supporting the identification of patients with certain chronic conditions. Furthermore, the patient's health status progression on the two-dimensional space might be used as a tool for clinicians aiming to characterize health conditions and identify their more relevant clinical codes.


Author(s):  
Muhammad Amjad

Advances in manifold learning have proven to be of great benefit in reducing the dimensionality of large complex datasets. Elements in an intricate dataset will typically belong in high-dimensional space as the number of individual features or independent variables will be extensive. However, these elements can be integrated into a low-dimensional manifold with well-defined parameters. By constructing a low-dimensional manifold and embedding it into high-dimensional feature space, the dataset can be simplified for easier interpretation. In spite of this elemental dimensionality reduction, the dataset’s constituents do not lose any information, but rather filter it with the hopes of elucidating the appropriate knowledge. This paper will explore the importance of this method of data analysis, its applications, and its extensions into topological data analysis.


2021 ◽  
Author(s):  
Kehinde Olobatuyi

Abstract Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of ”Curse of dimensionality” on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the ”FlexCWM” R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.


2021 ◽  
Vol 15 ◽  
Author(s):  
Ashkan Faghiri ◽  
Eswar Damaraju ◽  
Aysenil Belger ◽  
Judith M. Ford ◽  
Daniel Mathalon ◽  
...  

BackgroundA number of studies in recent years have explored whole-brain dynamic connectivity using pairwise approaches. There has been less focus on trying to analyze brain dynamics in higher dimensions over time.MethodsWe introduce a new approach that analyzes time series trajectories to identify high traffic nodes in a high dimensional space. First, functional magnetic resonance imaging (fMRI) data are decomposed using spatial ICA to a set of maps and their associated time series. Next, density is calculated for each time point and high-density points are clustered to identify a small set of high traffic nodes. We validated our method using simulations and then implemented it on a real data set.ResultsWe present a novel approach that captures dynamics within a high dimensional space and also does not use any windowing in contrast to many existing approaches. The approach enables one to characterize and study the time series in a potentially high dimensional space, rather than looking at each component pair separately. Our results show that schizophrenia patients have a lower dynamism compared to healthy controls. In addition, we find patients spend more time in nodes associated with the default mode network and less time in components strongly correlated with auditory and sensorimotor regions. Interestingly, we also found that subjects oscillate between state pairs that show opposite spatial maps, suggesting an oscillatory pattern.ConclusionOur proposed method provides a novel approach to analyze the data in its native high dimensional space and can possibly provide new information that is undetectable using other methods.


2018 ◽  
Author(s):  
Deepesh Agarwal ◽  
Ryan T. Fellers ◽  
Bryan P. Early ◽  
Dan Lu ◽  
Caroline J. DeHart ◽  
...  

Post-translational modifications (PTMs) at multiple sites can collectively influence protein function but the scope of such PTM coding has been challenging to determine. The number of potential combinatorial patterns of PTMs on a single molecule increases exponentially with the number of modification sites and a population of molecules exhibits a distribution of such “modforms”. Estimating these “modform distributions” is central to understanding how PTMs influence protein function. Although mass-spectrometry (MS) has made modforms more accessible, we have previously shown that current MS technology cannot recover the modform distribution of heavily modified proteins. However, MS data yield linear equations for modform amounts, which constrain the distribution within a high-dimensional, polyhedral “modform region”. Here, we show that linear programming (LP) can efficiently determine a range within which each modform value must lie, thereby approximating the modform region. We use this method on simulated data for mitogen-activated protein kinase 1 with the 7 phosphorylations reported on UniProt, giving a modform region in a 128 dimensional space. The exact dimension of the region is determined by the number of linearly independent equations but its size and shape depend on the data. The average modform range, which is a measure of size, reduces when data from bottom-up (BU) MS, in which proteins are first digested into peptides, is combined with data from top-down (TD) MS, in which whole proteins are analysed. Furthermore, when the modform distribution is structured, as might be expected of real distributions, the modform region for BU and TD combined has a more intricate polyhedral shape and is substantially more constrained than that of a random distribution. These results give the first insights into high-dimensional modform regions and confirm that fast LP methods can be used to analyse them. We discuss the problems of using modform regions with real data, when the actual modform distribution will not be known.


Author(s):  
D. E. Johnson

Increased specimen penetration; the principle advantage of high voltage microscopy, is accompanied by an increased need to utilize information on three dimensional specimen structure available in the form of two dimensional projections (i.e. micrographs). We are engaged in a program to develop methods which allow the maximum use of information contained in a through tilt series of micrographs to determine three dimensional speciman structure.In general, we are dealing with structures lacking in symmetry and with projections available from only a limited span of angles (±60°). For these reasons, we must make maximum use of any prior information available about the specimen. To do this in the most efficient manner, we have concentrated on iterative, real space methods rather than Fourier methods of reconstruction. The particular iterative algorithm we have developed is given in detail in ref. 3. A block diagram of the complete reconstruction system is shown in fig. 1.


Sign in / Sign up

Export Citation Format

Share Document