A PROBABILISTIC SELF-ORGANIZING MAP FOR BINARY DATA TOPOGRAPHIC CLUSTERING

Author(s):  
MUSTAPHA LEBBAH ◽  
YOUNÈS BENNANI ◽  
NICOLETA ROGOVSCHI

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.

2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2014 ◽  
pp. 68-75
Author(s):  
Oles Hodych ◽  
Yuriy Shcherbyna ◽  
Michael Zylan

In this article the authors propose an approach to forecasting the direction of the share price fluctuation, which is based on utilization of the Feedforward Neural Network in conjunction with Self-Organizing Map. It is proposed to use the Self-Organizing Map for filtration of the share price data set, whereas the Feedforward Neural Network is used to forecast the direction of the share price fluctuation based on the filtered data set. The comparison results are presented for filtered and non-filtered share price data sets.


Author(s):  
Tushar ◽  
Tushar ◽  
Shibendu Shekhar Roy ◽  
Dilip Kumar Pratihar

Clustering is a potential tool of data mining. A clustering method analyzes the pattern of a data set and groups the data into several clusters based on the similarity among themselves. Clusters may be either crisp or fuzzy in nature. The present chapter deals with clustering of some data sets using Fuzzy C-Means (FCM) algorithm and Entropy-based Fuzzy Clustering (EFC) algorithm. In FCM algorithm, the nature and quality of clusters depend on the pre-defined number of clusters, level of cluster fuzziness and a threshold value utilized for obtaining the number of outliers (if any). On the other hand, the quality of clusters obtained by the EFC algorithm is dependent on a constant used to establish the relationship between the distance and similarity of two data points, a threshold value of similarity and another threshold value used for determining the number of outliers. The clusters should ideally be distinct and at the same time compact in nature. Moreover, the number of outliers should be as minimum as possible. Thus, the above problem may be posed as an optimization problem, which will be solved using a Genetic Algorithm (GA). The best set of multi-dimensional clusters will be mapped into 2-D for visualization using a Self-Organizing Map (SOM).


2003 ◽  
Vol 13 (05) ◽  
pp. 353-365 ◽  
Author(s):  
ZHENG WU ◽  
GARY G. YEN

The Self-Organizing Map (SOM) is an efficient tool for visualizing high-dimensional data. In this paper, an intuitive and effective SOM projection method is proposed for mapping high-dimensional data onto the two-dimensional grid structure with a growing self-organizing mechanism. In the learning phase, a growing SOM is trained and the growing cell structure is used as the baseline framework. In the ordination phase, the new projection method is used to map the input vector so that the input data is mapped to the structure of the SOM without having to plot the weight values, resulting in easy visualization of the data. The projection method is demonstrated on four different data sets, including a 118 patent data set and a 399 checical abstract data set related to polymer cements, with promising results and a significantly reduced network size.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Xiaolan Chen ◽  
Hui Yang ◽  
Guifen Liu ◽  
Yong Zhang

Abstract Background Nucleosome organization is involved in many regulatory activities in various organisms. However, studies integrating nucleosome organization in mammalian genomes are very limited mainly due to the lack of comprehensive data quality control (QC) assessment and uneven data quality of public data sets. Results The NUCOME is a database focused on filtering qualified nucleosome organization referenced landscapes covering various cell types in human and mouse based on QC metrics. The filtering strategy guarantees the quality of nucleosome organization referenced landscapes and exempts users from redundant data set selection and processing. The NUCOME database provides standardized, qualified data source and informative nucleosome organization features at a whole-genome scale and on the level of individual loci. Conclusions The NUCOME provides valuable data resources for integrative analyses focus on nucleosome organization. The NUCOME is freely available at http://compbio-zhanglab.org/NUCOME.


2019 ◽  
Vol 35 (4) ◽  
pp. 373-384
Author(s):  
Cuong Le ◽  
Viet Vu Vu ◽  
Le Thi Kieu Oanh ◽  
Nguyen Thi Hai Yen

Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.


2021 ◽  
Vol 2123 (1) ◽  
pp. 012030
Author(s):  
D Rosadi ◽  
W Andriyani ◽  
D Arisanty ◽  
D Agustina

Abstract Prediction of the occurrences of forest fire has become interest of various research studies for instances, it is found that the hybrid method based on clustering using fuzzy c-means before doing the classification approach will improve the performance of prediction than directly apply the classification approach. In this study, we will consider the new hybrid approach between clustering based on Self Organizing Map (SOM) approach and classification using Boosting (AdaBoost) approach. Our empirical analysis shows using the same public data set, which has been used in several previous studies, the performance of hybrid SOM-AdaBoost will outperforms other methods in literatures.


2005 ◽  
Vol 15 (01n02) ◽  
pp. 101-110 ◽  
Author(s):  
TIMO SIMILÄ ◽  
SAMPSA LAINE

Practical data analysis often encounters data sets with both relevant and useless variables. Supervised variable selection is the task of selecting the relevant variables based on some predefined criterion. We propose a robust method for this task. The user manually selects a set of target variables and trains a Self-Organizing Map with these data. This sets a criterion to variable selection and is an illustrative description of the user's problem, even for multivariate target data. The user also defines another set of variables that are potentially related to the problem. Our method returns a subset of these variables, which best corresponds to the description provided by the Self-Organizing Map and, thus, agrees with the user's understanding about the problem. The method is conceptually simple and, based on experiments, allows an accessible approach to supervised variable selection.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiawei Lian ◽  
Junhong He ◽  
Yun Niu ◽  
Tianze Wang

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Kwang Baek Kim ◽  
Chang Won Kim

Accurate measures of liver fat content are essential for investigating hepatic steatosis. For a noninvasive inexpensive ultrasonographic analysis, it is necessary to validate the quantitative assessment of liver fat content so that fully automated reliable computer-aided software can assist medical practitioners without any operator subjectivity. In this study, we attempt to quantify the hepatorenal index difference between the liver and the kidney with respect to the multiple severity status of hepatic steatosis. In order to do this, a series of carefully designed image processing techniques, including fuzzy stretching and edge tracking, are applied to extract regions of interest. Then, an unsupervised neural learning algorithm, the self-organizing map, is designed to establish characteristic clusters from the image, and the distribution of the hepatorenal index values with respect to the different levels of the fatty liver status is experimentally verified to estimate the differences in the distribution of the hepatorenal index. Such findings will be useful in building reliable computer-aided diagnostic software if combined with a good set of other characteristic feature sets and powerful machine learning classifiers in the future.


Sign in / Sign up

Export Citation Format

Share Document