Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets

2019 ◽  
Vol 496 ◽  
pp. 572-591 ◽  
Author(s):  
Ameya Malondkar ◽  
Roberto Corizzo ◽  
Iluju Kiringa ◽  
Michelangelo Ceci ◽  
Nathalie Japkowicz
2017 ◽  
Vol 2017 ◽  
pp. 1-11 ◽  
Author(s):  
Adeoluwa Akande ◽  
Ana Cristina Costa ◽  
Jorge Mateu ◽  
Roberto Henriques

The explosion of data in the information age has provided an opportunity to explore the possibility of characterizing the climate patterns using data mining techniques. Nigeria has a unique tropical climate with two precipitation regimes: low precipitation in the north leading to aridity and desertification and high precipitation in parts of the southwest and southeast leading to large scale flooding. In this research, four indices have been used to characterize the intensity, frequency, and amount of rainfall over Nigeria. A type of Artificial Neural Network called the self-organizing map has been used to reduce the multiplicity of dimensions and produce four unique zones characterizing extreme precipitation conditions in Nigeria. This approach allowed for the assessment of spatial and temporal patterns in extreme precipitation in the last three decades. Precipitation properties in each cluster are discussed. The cluster closest to the Atlantic has high values of precipitation intensity, frequency, and duration, whereas the cluster closest to the Sahara Desert has low values. A significant increasing trend has been observed in the frequency of rainy days at the center of the northern region of Nigeria.


Author(s):  
Kanta Tachibana ◽  
◽  
Takeshi Furuhashi

Kohonen’s Self-Organizing feature Map (SOM) is used to obtain topology-preserving mapping from high-dimensional feature space to visible space of two or fewer dimensions. The SOM algorithm uses a fixed structure of neurons in visible space and learns a dataset by updating reference points in feature space. The mapping result depends on mapping parameters fixed, which are the number and visible positions of neurons, and parameters of learning, which are the learning rate, total iteration, and the setting of neighboring radii. To obtain a satisfactory result, the user usually must try many combinations of parameters. It is wasteful, however, to set up every possible combination of parameters and to repeatedly run the algorithm from the beginning because the computation cost for learning is large, especially for a large-scale dataset. These problems arise due to the fixing of two types of mapping parameters, i.e., the number and visible positions of neurons. The high computation cost is mainly in the calculation of distances from each sample to all reference points. At the beginning of learning, reference points should be adjusted globally to preserve the topology well because they are initially set far from optimal positions in feature space, e.g. randomly. Such many reference points subdivides feature space into unnecessarily fine Voronoi regions. To avoid this computational waste, it is natural to start learning with a small number of neurons and increase the number of neurons during learning. We propose a new SOM method that varies the number and visible positions of neurons, and thus is applicable also to visible torus and sphere spaces. We apply our proposal to spherical visible space. We use central Voronoi tessellation to move visible positions for two reasons: to tessellate visible space evenly for easy visualization and to level the number of neighboring neurons and better preserve topology. We demonstrate the effect of generating neurons to reduce computation cost and of moving visible positions in visualization and topology preservation.


2020 ◽  
Author(s):  
Chen Shi ◽  
Wang Kaicun ◽  
Zhou Chunlüe

<p>Heatwave is affected by large-scale atmospheric circulation on temperature-related climates in the context of global warming. Recently Northern China have experienced an increase in heatwaves which is partly due to the atmospheric circulation. This study aims to address the influence clearly. Northern China heatwaves are computed on excess hot factor (EHF) and the five EHF indexes are studied afterwards to get a picture of heatwaves in summer Northern China. China circulation patterns are classified into nine typical circulation patterns on self-organizing map (SOM) which then can be described quantitatively by pattern factors: frequency, persistence and maximum persistence. Pearson correlation analysis and stepwise regression analysis are applied for exploring the impact. Results show the spatial pattern of the times of individual heatwave event (HWN) and the days of the longest heatwave duration (HWD) are high value everywhere in Northern China. The overall EHF indexes all rising in time series (P<0.05) and the regional heatwave occurrence have trends of 0.79 day per year (P<0.05). However, the factors of the patterns show inconspicuous tendency. Two patterns with significant correlations (P<0.05) are proved to be suggestive of Okhotsk Sea high and West Pacific Subtropical High. It declares that the Okhotsk Sea high favors Northern China heatwave occurrence rather than subtropical high: the warm center over Okhotsk Sea transfer heat upper and west, generating the high temperature and persist high pressure system, causing heatwave happening in summer Northern China. The two related atmospheric circulation patterns explain 38% of the heatwave occurrence based on stepwise regression model, the Okhotsk Sea high gets the coefficient of 0.443 and the subtropical high is -0.347.  </p>


Kohonen Maps ◽  
1999 ◽  
pp. 375-387 ◽  
Author(s):  
Olli Simula ◽  
Jussi Ahola ◽  
Esa Alhoniemi ◽  
Johan Himberg ◽  
Juha Vesanto

2021 ◽  
Vol 17 (3) ◽  
pp. e1008804
Author(s):  
Hong Seo Lim ◽  
Peng Qiu

With the rapid advances of various single-cell technologies, an increasing number of single-cell datasets are being generated, and the computational tools for aligning the datasets which make subsequent integration or meta-analysis possible have become critical. Typically, single-cell datasets from different technologies cannot be directly combined or concatenated, due to the innate difference in the data, such as the number of measured parameters and the distributions. Even datasets generated by the same technology are often affected by the batch effect. A computational approach for aligning different datasets and hence identifying related clusters will be useful for data integration and interpretation in large scale single-cell experiments. Our proposed algorithm called JSOM, a variation of the Self-organizing map, aligns two related datasets that contain similar clusters, by constructing two maps—low-dimensional discretized representation of datasets–that jointly evolve according to both datasets. Here we applied the JSOM algorithm to flow cytometry, mass cytometry, and single-cell RNA sequencing datasets. The resulting JSOM maps not only align the related clusters in the two datasets but also preserve the topology of the datasets so that the maps could be used for further analysis, such as clustering.


2012 ◽  
Vol 2012 ◽  
pp. 1-14 ◽  
Author(s):  
Tonny J. Oyana ◽  
Luke E. K. Achenie ◽  
Joon Heo

The objective of this paper is to introduce an efficient algorithm, namely, the mathematically improved learning-self organizing map (MIL-SOM) algorithm, which speeds up the self-organizing map (SOM) training process. In the proposed MIL-SOM algorithm, the weights of Kohonen’s SOM are based on the proportional-integral-derivative (PID) controller. Thus, in a typical SOM learning setting, this improvement translates to faster convergence. The basic idea is primarily motivated by the urgent need to develop algorithms with the competence to converge faster and more efficiently than conventional techniques. The MIL-SOM algorithm is tested on four training geographic datasets representing biomedical and disease informatics application domains. Experimental results show that the MIL-SOM algorithm provides a competitive, better updating procedure and performance, good robustness, and it runs faster than Kohonen’s SOM.


2012 ◽  
Vol 23 ◽  
pp. 394-401 ◽  
Author(s):  
Fengqing Li ◽  
Qinghua Cai ◽  
Xiaodong Qu ◽  
Tao Tang ◽  
Naicheng Wu ◽  
...  

2016 ◽  
Vol 3 (2) ◽  
pp. 160
Author(s):  
Fajar Rohman Hariri ◽  
Danar Putra Pamungkas

Data berukuran besar yang sudah disimpan jarang digunakan secara optimal karena kemampuan manusia yang terbatas untuk mengelolanya. Salah satu data berskala besar adalah data teks. Data teks memiliki fitur yang besar sehingga untuk mengolahnya memerlukan waktu komputasi yang besar pula. Proses clustering menggunakan metode Self Organizing Map dengan menerapkan reduksi dimensi pada tahap preprosesing. Metode ini diterapkan untuk mengelompokkan data tugas akhir mahasiswa Teknik Informatika Universitas Trunojoyo Madura. Dalam metode yang diusulkan, analisis morfologi dilakukan pada teks abstrak tugas akhir mahasiswa untuk menghasilkan vektor input dengan unsur term dari tugas akhir tersebut. Dari percobaan yang dilakukan, diperoleh hasil bahwa optimum cluster menghasilkan nilai rata-rata SSE = 0.01117.Large data that is stored used rarely optimally because of the limited human ability to manage it. One of large-scale data is text data. Text data has enormous features so as to process it requires greater computational time. Clustering process using Self Organizing Map by applying dimensionality reduction on preprocessing. This method is applied to cluster the Informatics Engineering students' final assignment data of Trunojoyo University. In the proposed method, morphological analysis is applied on the abstract of final assignment to generate input vectors using elements of the final assignment. From the experiments conducted, the result that the best cluster to abstract data, average value of SSE = 0.01117.


Sign in / Sign up

Export Citation Format

Share Document