scholarly journals A Novel THz Differential Spectral Clustering Recognition Method Based on t-SNE

2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Tie-Jun Li ◽  
Chih-Cheng Chen ◽  
Jian-jun Liu ◽  
Gui-fang Shao ◽  
Christopher Chun Ki Chan

We apply time-domain spectroscopy (THz) imaging technology to perform nondestructive detection on three industrial ceramic matrix composite (CMC) samples and one silicon slice with defects. In terms of spectrum recognition, a low-resolution THz spectrum image results in an ineffective recognition on sample defect features. Therefore, in this article, we propose a spectrum clustering recognition model based on t-distribution stochastic neighborhood embedding (t-SNE) to address this ineffective sample defect recognition. Firstly, we propose a model to recognize a reduced dimensional clustering of different spectrums drawn from the imaging spectrum data sets, in order to judge whether a sample includes a feature indicating a defect or not in a low-dimensional space. Second, we improve computation efficiency by mapping spectrum data samples from high-dimensional space to low-dimensional space by the use of a manifold learning algorithm (t-SNE). Finally, to achieve a visible observation of sample features in low-dimensional space, we use a conditional probability distribution to measure the distance invariant similarity. Comparative experiments indicate that our model can judge the existence of sample defect features or not through spectrum clustering, as a predetection process for image analysis.

2019 ◽  
Vol 15 (3) ◽  
pp. 346-358
Author(s):  
Luciano Barbosa

Purpose Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution. Design/methodology/approach To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities. Findings The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets. Originality/value No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.


2013 ◽  
Vol 645 ◽  
pp. 192-195 ◽  
Author(s):  
Xiao Zhou Chen

Dimension reduction is an important issue to understand microarray data. In this study, we proposed a efficient approach for dimensionality reduction of microarray data. Our method allows to apply the manifold learning algorithm to analyses dimensionality reduction of microarray data. The intra-/inter-category distances were used as the criteria to quantitatively evaluate the effects of data dimensionality reduction. Colon cancer and leukaemia gene expression datasets are selected for our investigation. When the neighborhood parameter was effectivly set, all the intrinsic dimension numbers of data sets were low. Therefore, manifold learning is used to study microarray data in the low-dimensional projection space. Our results indicate that Manifold learning method possesses better effects than the linear methods in analysis of microarray data, which is suitable for clinical diagnosis and other medical applications.


2017 ◽  
Vol 29 (4) ◽  
pp. 1053-1102 ◽  
Author(s):  
Hossein Soleimani ◽  
David J. Miller

Many classification tasks require both labeling objects and determining label associations for parts of each object. Example applications include labeling segments of images or determining relevant parts of a text document when the training labels are available only at the image or document level. This task is usually referred to as multi-instance (MI) learning, where the learner typically receives a collection of labeled (or sometimes unlabeled) bags, each containing several segments (instances). We propose a semisupervised MI learning method for multilabel classification. Most MI learning methods treat instances in each bag as independent and identically distributed samples. However, in many practical applications, instances are related to each other and should not be considered independent. Our model discovers a latent low-dimensional space that captures structure within each bag. Further, unlike many other MI learning methods, which are primarily developed for binary classification, we model multiple classes jointly, thus also capturing possible dependencies between different classes. We develop our model within a semisupervised framework, which leverages both labeled and, typically, a larger set of unlabeled bags for training. We develop several efficient inference methods for our model. We first introduce a Markov chain Monte Carlo method for inference, which can handle arbitrary relations between bag labels and instance labels, including the standard hard-max MI assumption. We also develop an extension of our model that uses stochastic variational Bayes methods for inference, and thus scales better to massive data sets. Experiments show that our approach outperforms several MI learning and standard classification methods on both bag-level and instance-level label prediction. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM .


2003 ◽  
Vol 2 (1) ◽  
pp. 68-77 ◽  
Author(s):  
Alistair Morrison ◽  
Greg Ross ◽  
Matthew Chalmers

The term ‘proximity data’ refers to data sets within which it is possible to assess the similarity of pairs of objects. Multidimensional scaling (MDS) is applied to such data and attempts to map high-dimensional objects onto low-dimensional space through the preservation of these similarity relations. Standard MDS techniques have in the past suffered from high computational complexity and, as such, could not feasibly be applied to data sets over a few thousand objects in size. Through a novel hybrid approach based upon stochastic sampling, interpolation and spring models, we have designed an algorithm running in O( N√N). Using Chalmers’ 1996 O( N2) spring model as a benchmark for the evaluation of our technique, we compare layout quality and run times using sets of synthetic and real data. Our algorithm executes significantly faster than Chalmers’ 1996 algorithm, while producing superior layouts. In reducing complexity and run time, we allow the visualisation of data sets of previously infeasible size. Our results indicate that our method is a solid foundation for interactive and visual exploration of data.


2021 ◽  
Author(s):  
Stefan Canzar ◽  
Van Hoan Do ◽  
Slobodan Jelic ◽  
Soeren Laue ◽  
Domagoj Matijevic ◽  
...  

Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a neural network based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.


Author(s):  
Lukas Miklautz ◽  
Lena G. M. Bauer ◽  
Dominik Mautz ◽  
Sebastian Tschiatschek ◽  
Christian Böhm ◽  
...  

Deep clustering techniques combine representation learning with clustering objectives to improve their performance. Among existing deep clustering techniques, autoencoder-based methods are the most prevalent ones. While they achieve promising clustering results, they suffer from an inherent conflict between preserving details, as expressed by the reconstruction loss, and finding similar groups by ignoring details, as expressed by the clustering loss. This conflict leads to brittle training procedures, dependence on trade-off hyperparameters and less interpretable results. We propose our framework, ACe/DeC, that is compatible with Autoencoder Centroid based Deep Clustering methods and automatically learns a latent representation consisting of two separate spaces. The clustering space captures all cluster-specific information and the shared space explains general variation in the data. This separation resolves the above mentioned conflict and allows our method to learn both detailed reconstructions and cluster specific abstractions. We evaluate our framework with extensive experiments to show several benefits: (1) cluster performance – on various data sets we outperform relevant baselines; (2) no hyperparameter tuning – this improved performance is achieved without introducing new clustering specific hyperparameters; (3) interpretability – isolating the cluster specific information in a separate space is advantageous for data exploration and interpreting the clustering results; and (4) dimensionality of the embedded space – we automatically learn a low dimensional space for clustering. Our ACe/DeC framework isolates cluster information, increases stability and interpretability, while improving cluster performance.


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Mingxia Chen ◽  
Jing Wang ◽  
Xueqing Li ◽  
Xiaolong Sun

In the recent years, manifold learning methods have been widely used in data classification to tackle the curse of dimensionality problem, since they can discover the potential intrinsic low-dimensional structures of the high-dimensional data. Given partially labeled data, the semi-supervised manifold learning algorithms are proposed to predict the labels of the unlabeled points, taking into account label information. However, these semi-supervised manifold learning algorithms are not robust against noisy points, especially when the labeled data contain noise. In this paper, we propose a framework for robust semi-supervised manifold learning (RSSML) to address this problem. The noisy levels of the labeled points are firstly predicted, and then a regularization term is constructed to reduce the impact of labeled points containing noise. A new robust semi-supervised optimization model is proposed by adding the regularization term to the traditional semi-supervised optimization model. Numerical experiments are given to show the improvement and efficiency of RSSML on noisy data sets.


2006 ◽  
Vol 12 (1) ◽  
pp. 69-75 ◽  
Author(s):  
Antanas Žilinskas ◽  
Julius Žilinskas

Experimental sciences collect large amounts of data. Different techniques are available for information elicitation from data. Frequently statistical analysis should be combined with the experience and intuition of researchers. Human heuristic abilities are developed and oriented to patterns in space of dimensionality up to 3. Multidimensional scaling (MDS) addresses the problem how objects represented by proximity data can be represented by points in low dimensional space. MDS methods are implemented as the optimization of a stress function measuring fit of the proximity data by the distances between the respective points. Since the optimization problem is multimodal, a global optimization method should be used. In the present paper a combination of an evolutionary metaheuristic algorithm with a local search algorithm is used. The experimental results show the influence of metrics defining distances in the considered spaces on the results of multidimensional scaling. Data sets with known and unknown structure and different dimensionality (up to 512 variables) have been visualized.


Author(s):  
Peiyan Li ◽  
Honglian Wang ◽  
Christian Böhm ◽  
Junming Shao

Online semi-supervised multi-label classification serves a practical yet challenging task since only a small number of labeled instances are available in real streaming environments. However, the mainstream of existing online classification techniques are focused on the single-label case, while only a few multi-label stream classification algorithms exist, and they are mainly trained on labeled instances. In this paper, we present a novel Online Semi-supervised Multi-Label learning algorithm (OnSeML) based on label compression and local smooth regression, which allows real-time multi-label predictions in a semi-supervised setting and is robust to evolving label distributions. Specifically, to capture the high-order label relationship and to build a compact target space for regression, OnSeML compresses the label set into a low-dimensional space by a fixed orthogonal label encoder. Then a locally defined regression function for each incoming instance is obtained with a closed-form solution. Targeting the evolving label distribution problem, we propose an adaptive decoding scheme to adequately integrate newly arriving labeled data. Extensive experiments provide empirical evidence for the effectiveness of our approach.


2020 ◽  
Author(s):  
Eric Johnson ◽  
William Kath ◽  
Madhav Mani

AbstractSingle-cell RNA sequencing (scRNA-seq) experiments often measure thousands of genes, making them high-dimensional data sets. As a result, dimensionality reduction (DR) algorithms such as t-SNE and UMAP are necessary for data visualization. However, the use of DR methods in other tasks, such as for cell-type detection or developmental trajectory reconstruction, is stymied by unquantified non-linear and stochastic deformations in the mapping from the high- to low-dimensional space. In this work, we present a statistical framework for the quantification of embedding quality so that DR algorithms can be used with confidence in unsupervised applications. Specifically, this framework generates a local assessment of embedding quality by statistically integrating information across embeddings. Furthermore, the approach separates biological signal from noise via the construction of an empirical null hypothesis. Using this approach on scRNA-seq data reveals biologically relevant structure and suggests a novel “spectral” decomposition of data. We apply the framework to several data sets and DR methods, illustrating its robustness and flexibility as well as its widespread utility in several quantitative applications.


Sign in / Sign up

Export Citation Format

Share Document