Clustering of Zika viruses originating from different geographical regions using computational sequence descriptors

Author(s):  
Marjan Vračko ◽  
Subhash C. Basak ◽  
Dwaipayan Sen ◽  
Ashesh Nandy

: In this report we consider a data set, which consists of 310 Zika virus genome sequences taken from different continents, Africa, Asia and South America. The sequences, which were compiled from GenBank, were derived from the host cells of different mammalian species (Simiiformes, Aedes opok, Aedes africanus, Aedes luteocephalus, Aedes dalzieli, Aedes aegypti, and Homo sapiens). For chemometrical treatment the sequences have been represented by sequence descriptors derived from their graphs or neighborhood matrices. The set was analyzed with three chemometrical methods: Mahalanobis distances, principal component analysis (PCA) and self organizing maps (SOM). A good separation of samples with respect to the region of origin was observed using these three methods. Background: Study of 310 Zika virus genome sequences from different continents. Objective: To characterize and compare Zika virus sequences from around the world using alignment-free sequence comparison and chemometrical methods. Method: Mahalanobis distance analysis, self organizing maps, principal components were used to carry out the chemometrical analyses of the Zika sequence data. Results: Genome sequences are clustered with respect to the region of origin (continent, country) Conclusion: Africa samples are well separated from Asian and South American ones.

Author(s):  
Nazar Elfadil ◽  

Self-organizing maps are unsupervised neural network models that lend themselves to the cluster analysis of high-dimensional input data. Interpreting a trained map is difficult because features responsible for specific cluster assignment are not evident from resulting map representation. This paper presents an approach to automated knowledge acquisition using Kohonen's self-organizing maps and k-means clustering. To demonstrate the architecture and validation, a data set representing animal world has been used as the training data set. The verification of the produced knowledge base is done by using conventional expert system.


1999 ◽  
Vol 09 (03) ◽  
pp. 195-202 ◽  
Author(s):  
JOSÉ ALFREDO FERREIRA COSTA ◽  
MÁRCIO LUIZ DE ANDRADE NETTO

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.


2001 ◽  
Vol 23 (1) ◽  
pp. 29-37 ◽  
Author(s):  
Torsten Mattfeldt ◽  
Hubertus Wolter ◽  
Ralf Kemmerling ◽  
Hans‐Werner Gottfried ◽  
Hans A. Kestler

Comparative genomic hybridization (CGH) is a modern genetic method which enables a genome‐wide survey of chromosomal imbalances. For each chromosome region, one obtains the information whether there is a loss or gain of genetic material, or whether there is no change at that region. Usually it is not possible to evaluate all 46 chromosomes of a metaphase, therefore several (up to 20 or more) metaphases are analyzed per individual, and expressed as average. Mostly one does not study one individual alone but groups of 20–30 individuals. Therefore, large amounts of data quickly accumulate which must be put into a logical order. In this paper we present the application of a self‐organizing map (Genecluster) as a tool for cluster analysis of data from pT2N0 prostate cancer cases studied by CGH. Self‐organizing maps are artificial neural networks with the capability to form clusters on the basis of an unsupervised learning rule, i.e., in our examples it gets the CGH data as only information (no clinical data). We studied a group of 40 recent cases without follow‐up, an older group of 20 cases with follow‐up, and the data set obtained by pooling both groups. In all groups good clusterings were found in the sense that clinically similar cases were placed into the same clusters on the basis of the genetic information only. The data indicate that losses on chromosome arms 6q, 8p and 13q are all frequent in pT2N0 prostatic cancer, but the loss on 8p has probably the largest prognostic importance.


2011 ◽  
Vol 16 (4) ◽  
pp. 488-504 ◽  
Author(s):  
Pavel Stefanovič ◽  
Olga Kurasova

In the article, an additional visualization of self-organizing maps (SOM) has been investigated. The main objective of self-organizing maps is data clustering and their graphical presentation. Opportunities of SOM visualization in four systems (NeNet, SOM-Toolbox, Databionic ESOM and Viscovery SOMine) have been investigated. Each system has its additional tools for visualizing SOM. A comparative analysis has been made for two data sets: Fisher’s iris data set and the economic indices of the European Union countries. A new SOM system is also introduced and researched. The system has a specific visualization tool. It is missing in other SOM systems. It helps to see the proportion of neurons, corresponding to the data items, belonging to the different classes, and fallen in the same SOM cell.


Abstract A nuanced analysis of the spatial and temporal distribution of supercell tornadoes and the characteristics of the near-storm environments associated with those tornadoes is critical to improving our understanding of the range of environments that can be considered tornado-favorable. This work classifies both supercell tornado probabilities and their associated environmental parameters on hourly and daily time scales based on geographical regions: regional probability of tornado events and the probability of deviation above or below the median tornadic near-storm environmental parameter values are estimated by kernel density estimation and classified by self-organizing maps (SOMs). The SOM classification for tornado probability allows for further examination of the deviation of the environmental parameters from the median for each probability cluster. Regions that have similar tornado probabilities but differ in the deviation of the environmental parameters (“parameter anomalies”) are also highlighted using SOMs. The anomaly patterns for different regions and parameters generally evolve along either seasonal or diurnal scales, but rarely both, highlighting the need for flexible models of tornado potential based on the near-storm environment. The spatial and temporal variability of parameter anomalies add complexity to traditional forecasting approaches that depend upon a fixed set of environmental parameter thresholds. This work highlights the need to develop region-specific and potentially time-specific environmental baseline evaluation to improve forecast and warning skill.


2015 ◽  
Vol 5 (1) ◽  
pp. 1-12
Author(s):  
Chris Gorman ◽  
Clint Rogers ◽  
Iren Valova

AbstractSelf-organizing maps are extremely useful in the field of pattern recognition. They become less useful, however, when neurons fail to activate during training. This phenomenon occurs when neurons are initialized in areas of non-input and are far enough away from the input data to never move toward the input. These neurons effectively misrepresent the data set. This results in, among other things, patterns becoming unrecognizable.We introduce an algorithm called No Neuron Left Behind to solve this problem.We show that our algorithm produces a more accurate topological representation of the input space.We also show that no neuron clusters form in areas of noninput and that mapping quality of the SOM increases drastically when our algorithm is implemented. Finally, the running time of NNLB is better or comparable to classic SOM without it.


2008 ◽  
Vol 12 (2) ◽  
pp. 657-667 ◽  
Author(s):  
M. Herbst ◽  
M. C. Casper

Abstract. The reduction of information contained in model time series through the use of aggregating statistical performance measures is very high compared to the amount of information that one would like to draw from it for model identification and calibration purposes. It has been readily shown that this loss imposes important limitations on model identification and -diagnostics and thus constitutes an element of the overall model uncertainty. In this contribution we present an approach using a Self-Organizing Map (SOM) to circumvent the identifiability problem induced by the low discriminatory power of aggregating performance measures. Instead, a Self-Organizing Map is used to differentiate the spectrum of model realizations, obtained from Monte-Carlo simulations with a distributed conceptual watershed model, based on the recognition of different patterns in time series. Further, the SOM is used instead of a classical optimization algorithm to identify those model realizations among the Monte-Carlo simulation results that most closely approximate the pattern of the measured discharge time series. The results are analyzed and compared with the manually calibrated model as well as with the results of the Shuffled Complex Evolution algorithm (SCE-UA). In our study the latter slightly outperformed the SOM results. The SOM method, however, yields a set of equivalent model parameterizations and therefore also allows for confining the parameter space to a region that closely represents a measured data set. This particular feature renders the SOM potentially useful for future model identification applications.


2017 ◽  
Vol 5 (2) ◽  
pp. T163-T171 ◽  
Author(s):  
Tao Zhao ◽  
Fangyu Li ◽  
Kurt J. Marfurt

Pattern recognition-based seismic facies analysis techniques are commonly used in modern quantitative seismic interpretation. However, interpreters often treat techniques such as artificial neural networks and self-organizing maps (SOMs) as a “black box” that somehow correlates a suite of attributes to a desired geomorphological or geomechanical facies. Even when the statistical correlations are good, the inability to explain such correlations through principles of geology or physics results in suspicion of the results. The most common multiattribute facies analysis begins by correlating a suite of candidate attributes to a desired output, keeping those that correlate best for subsequent analysis. The analysis then takes place in attribute space rather than ([Formula: see text], [Formula: see text], and [Formula: see text]) space, removing spatial trends often observed by interpreters. We add a stratigraphy layering component to a SOM model that attempts to preserve the intersample relation along the vertical axis. Specifically, we use a mode decomposition algorithm to capture the sedimentary cycle pattern as an “attribute.” If we correlate this attribute to the training data, it will favor SOM facies maps that follow stratigraphy. We apply this workflow to a Barnett Shale data set and find that the constrained SOM facies map shows layers that are easily overlooked on traditional unconstrained SOM facies map.


Sign in / Sign up

Export Citation Format

Share Document