scholarly journals Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa - A Large Romanian Sentiment Data Set

Author(s):  
Anca Tache ◽  
Gaman Mihaela ◽  
Radu Tudor Ionescu
Author(s):  
Nazar Elfadil ◽  

Self-organizing maps are unsupervised neural network models that lend themselves to the cluster analysis of high-dimensional input data. Interpreting a trained map is difficult because features responsible for specific cluster assignment are not evident from resulting map representation. This paper presents an approach to automated knowledge acquisition using Kohonen's self-organizing maps and k-means clustering. To demonstrate the architecture and validation, a data set representing animal world has been used as the training data set. The verification of the produced knowledge base is done by using conventional expert system.


Author(s):  
Marjan Vračko ◽  
Subhash C. Basak ◽  
Dwaipayan Sen ◽  
Ashesh Nandy

: In this report we consider a data set, which consists of 310 Zika virus genome sequences taken from different continents, Africa, Asia and South America. The sequences, which were compiled from GenBank, were derived from the host cells of different mammalian species (Simiiformes, Aedes opok, Aedes africanus, Aedes luteocephalus, Aedes dalzieli, Aedes aegypti, and Homo sapiens). For chemometrical treatment the sequences have been represented by sequence descriptors derived from their graphs or neighborhood matrices. The set was analyzed with three chemometrical methods: Mahalanobis distances, principal component analysis (PCA) and self organizing maps (SOM). A good separation of samples with respect to the region of origin was observed using these three methods. Background: Study of 310 Zika virus genome sequences from different continents. Objective: To characterize and compare Zika virus sequences from around the world using alignment-free sequence comparison and chemometrical methods. Method: Mahalanobis distance analysis, self organizing maps, principal components were used to carry out the chemometrical analyses of the Zika sequence data. Results: Genome sequences are clustered with respect to the region of origin (continent, country) Conclusion: Africa samples are well separated from Asian and South American ones.


1999 ◽  
Vol 09 (03) ◽  
pp. 195-202 ◽  
Author(s):  
JOSÉ ALFREDO FERREIRA COSTA ◽  
MÁRCIO LUIZ DE ANDRADE NETTO

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.


2001 ◽  
Vol 23 (1) ◽  
pp. 29-37 ◽  
Author(s):  
Torsten Mattfeldt ◽  
Hubertus Wolter ◽  
Ralf Kemmerling ◽  
Hans‐Werner Gottfried ◽  
Hans A. Kestler

Comparative genomic hybridization (CGH) is a modern genetic method which enables a genome‐wide survey of chromosomal imbalances. For each chromosome region, one obtains the information whether there is a loss or gain of genetic material, or whether there is no change at that region. Usually it is not possible to evaluate all 46 chromosomes of a metaphase, therefore several (up to 20 or more) metaphases are analyzed per individual, and expressed as average. Mostly one does not study one individual alone but groups of 20–30 individuals. Therefore, large amounts of data quickly accumulate which must be put into a logical order. In this paper we present the application of a self‐organizing map (Genecluster) as a tool for cluster analysis of data from pT2N0 prostate cancer cases studied by CGH. Self‐organizing maps are artificial neural networks with the capability to form clusters on the basis of an unsupervised learning rule, i.e., in our examples it gets the CGH data as only information (no clinical data). We studied a group of 40 recent cases without follow‐up, an older group of 20 cases with follow‐up, and the data set obtained by pooling both groups. In all groups good clusterings were found in the sense that clinically similar cases were placed into the same clusters on the basis of the genetic information only. The data indicate that losses on chromosome arms 6q, 8p and 13q are all frequent in pT2N0 prostatic cancer, but the loss on 8p has probably the largest prognostic importance.


2011 ◽  
Vol 16 (4) ◽  
pp. 488-504 ◽  
Author(s):  
Pavel Stefanovič ◽  
Olga Kurasova

In the article, an additional visualization of self-organizing maps (SOM) has been investigated. The main objective of self-organizing maps is data clustering and their graphical presentation. Opportunities of SOM visualization in four systems (NeNet, SOM-Toolbox, Databionic ESOM and Viscovery SOMine) have been investigated. Each system has its additional tools for visualizing SOM. A comparative analysis has been made for two data sets: Fisher’s iris data set and the economic indices of the European Union countries. A new SOM system is also introduced and researched. The system has a specific visualization tool. It is missing in other SOM systems. It helps to see the proportion of neurons, corresponding to the data items, belonging to the different classes, and fallen in the same SOM cell.


2015 ◽  
Vol 5 (1) ◽  
pp. 1-12
Author(s):  
Chris Gorman ◽  
Clint Rogers ◽  
Iren Valova

AbstractSelf-organizing maps are extremely useful in the field of pattern recognition. They become less useful, however, when neurons fail to activate during training. This phenomenon occurs when neurons are initialized in areas of non-input and are far enough away from the input data to never move toward the input. These neurons effectively misrepresent the data set. This results in, among other things, patterns becoming unrecognizable.We introduce an algorithm called No Neuron Left Behind to solve this problem.We show that our algorithm produces a more accurate topological representation of the input space.We also show that no neuron clusters form in areas of noninput and that mapping quality of the SOM increases drastically when our algorithm is implemented. Finally, the running time of NNLB is better or comparable to classic SOM without it.


2008 ◽  
Vol 12 (2) ◽  
pp. 657-667 ◽  
Author(s):  
M. Herbst ◽  
M. C. Casper

Abstract. The reduction of information contained in model time series through the use of aggregating statistical performance measures is very high compared to the amount of information that one would like to draw from it for model identification and calibration purposes. It has been readily shown that this loss imposes important limitations on model identification and -diagnostics and thus constitutes an element of the overall model uncertainty. In this contribution we present an approach using a Self-Organizing Map (SOM) to circumvent the identifiability problem induced by the low discriminatory power of aggregating performance measures. Instead, a Self-Organizing Map is used to differentiate the spectrum of model realizations, obtained from Monte-Carlo simulations with a distributed conceptual watershed model, based on the recognition of different patterns in time series. Further, the SOM is used instead of a classical optimization algorithm to identify those model realizations among the Monte-Carlo simulation results that most closely approximate the pattern of the measured discharge time series. The results are analyzed and compared with the manually calibrated model as well as with the results of the Shuffled Complex Evolution algorithm (SCE-UA). In our study the latter slightly outperformed the SOM results. The SOM method, however, yields a set of equivalent model parameterizations and therefore also allows for confining the parameter space to a region that closely represents a measured data set. This particular feature renders the SOM potentially useful for future model identification applications.


2017 ◽  
Vol 5 (2) ◽  
pp. T163-T171 ◽  
Author(s):  
Tao Zhao ◽  
Fangyu Li ◽  
Kurt J. Marfurt

Pattern recognition-based seismic facies analysis techniques are commonly used in modern quantitative seismic interpretation. However, interpreters often treat techniques such as artificial neural networks and self-organizing maps (SOMs) as a “black box” that somehow correlates a suite of attributes to a desired geomorphological or geomechanical facies. Even when the statistical correlations are good, the inability to explain such correlations through principles of geology or physics results in suspicion of the results. The most common multiattribute facies analysis begins by correlating a suite of candidate attributes to a desired output, keeping those that correlate best for subsequent analysis. The analysis then takes place in attribute space rather than ([Formula: see text], [Formula: see text], and [Formula: see text]) space, removing spatial trends often observed by interpreters. We add a stratigraphy layering component to a SOM model that attempts to preserve the intersample relation along the vertical axis. Specifically, we use a mode decomposition algorithm to capture the sedimentary cycle pattern as an “attribute.” If we correlate this attribute to the training data, it will favor SOM facies maps that follow stratigraphy. We apply this workflow to a Barnett Shale data set and find that the constrained SOM facies map shows layers that are easily overlooked on traditional unconstrained SOM facies map.


Author(s):  
Mikko Heikkinen ◽  
Ville Nurminen ◽  
Yrjö Hiltunen

Self-organizing maps (SOM) have been successfully applied in many fields of research. In this paper, we demonstrate the use of SOM-based method for the analysis of Expandable PolyStyrene (EPS) batch process. To this end, a data set of EPS-batch process was used for training a SOM. Reference vectors of the SOM were then classified by K-means algorithm into six clusters, which represent product types of the process. This SOM could also be used for estimating the optimal amounts of the stabilisation agent. The results of a validation data set showed a good agreement between the actual and estimated amounts of the stabilisation agent. Based on this model a Web application was made for test use at the plant. The results indicate that the SOM method can also be efficiently applied to the analysis of the batch process.


Sign in / Sign up

Export Citation Format

Share Document