Features of Distributional Method for Indonesian Word Clustering

2019 ◽  
Vol 5 (2) ◽  
pp. 164
Author(s):  
Herry Sujaini

We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.

2011 ◽  
Vol 58-60 ◽  
pp. 995-1000
Author(s):  
Li Chi Yuan

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks about this model: (1) the problem of word clustering, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari-gram model.


Author(s):  
Herry Sujaini

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.


Sensors ◽  
2019 ◽  
Vol 19 (6) ◽  
pp. 1475 ◽  
Author(s):  
Hongjun Wang ◽  
Zhen Yang ◽  
Yingchun Shi

As an emerging class of spatial trajectory data, mobile user trajectory data can be used to analyze individual or group behavioral characteristics, hobbies and interests. Besides, the information extracted from original trajectory data is widely used in smart cities, transportation planning, and anti-terrorism maintenance. In order to identify the important locations of the target user from his trajectory data, a novel division method for preprocessing trajectory data is proposed, the feature points of original trajectory are extracted according to the change of trajectory structural, and then important locations are extracted by clustering the feature points, using an improved density peak clustering algorithm. Finally, in order to predict next location of mobile users, a multi-order fusion Markov model based on the Adaboost algorithm is proposed, the model order k is adaptively determined, and the weight coefficients of the 1~k-order models are given by the Adaboost algorithm according to the importance of various order models, a multi-order fusion Markov model is generated to predict next important location of the user. The experimental results on the real user trajectory dataset Geo-life show that the prediction performance of Adaboost-Markov model is better than the multi-order fusion Markov model with equal coefficient, and the universality and prediction performance of Adaboost-Markov model is better than the first to third order Markov models.


2019 ◽  
Vol 5 (11) ◽  
pp. 85 ◽  
Author(s):  
Ayan Chatterjee ◽  
Peter W. T. Yuen

This paper proposes a simple yet effective method for improving the efficiency of sparse coding dictionary learning (DL) with an implication of enhancing the ultimate usefulness of compressive sensing (CS) technology for practical applications, such as in hyperspectral imaging (HSI) scene reconstruction. CS is the technique which allows sparse signals to be decomposed into a sparse representation “a” of a dictionary D u . The goodness of the learnt dictionary has direct impacts on the quality of the end results, e.g., in the HSI scene reconstructions. This paper proposes the construction of a concise and comprehensive dictionary by using the cluster centres of the input dataset, and then a greedy approach is adopted to learn all elements within this dictionary. The proposed method consists of an unsupervised clustering algorithm (K-Means), and it is then coupled with an advanced sparse coding dictionary (SCD) method such as the basis pursuit algorithm (orthogonal matching pursuit, OMP) for the dictionary learning. The effectiveness of the proposed K-Means Sparse Coding Dictionary (KMSCD) is illustrated through the reconstructions of several publicly available HSI scenes. The results have shown that the proposed KMSCD achieves ~40% greater accuracy, 5 times faster convergence and is twice as robust as that of the classic Spare Coding Dictionary (C-SCD) method that adopts random sampling of data for the dictionary learning. Over the five data sets that have been employed in this study, it is seen that the proposed KMSCD is capable of reconstructing these scenes with mean accuracies of approximately 20–500% better than all competing algorithms adopted in this work. Furthermore, the reconstruction efficiency of trace materials in the scene has been assessed: it is shown that the KMSCD is capable of recovering ~12% better than that of the C-SCD. These results suggest that the proposed DL using a simple clustering method for the construction of the dictionary has been shown to enhance the scene reconstruction substantially. When the proposed KMSCD is incorporated with the Fast non-negative orthogonal matching pursuit (FNNOMP) to constrain the maximum number of materials to coexist in a pixel to four, experiments have shown that it achieves approximately ten times better than that constrained by using the widely employed TMM algorithm. This may suggest that the proposed DL method using KMSCD and together with the FNNOMP will be more suitable to be the material allocation module of HSI scene simulators like the CameoSim package.


2010 ◽  
Vol 439-440 ◽  
pp. 481-485
Author(s):  
Li Xia Liu ◽  
Yi Qi Zhuang

Clustering techniques are often used in Web log mining to analyze user’s interest on the web pages. Based on the analysis of advantages and disadvantages of the application of classic clustering algorithm in Web log data mining, the paper brought out a kind of hierarchical K-means Web log clustering algorithm, which integrated K-means clustering algorithm and cohesion-based hierarchical clustering algorithm and overcame shortcoming of high time complexity of hierarchical clustering algorithm. The clustering effect of the algorithm is better than K-means clustering and fit for clustering process of large amount data. The result analysis of practical Web log data clustering also proves the validity of the algorithm.


Sign in / Sign up

Export Citation Format

Share Document