Author(s):  
Herry Sujaini

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.


2011 ◽  
Vol 58-60 ◽  
pp. 995-1000
Author(s):  
Li Chi Yuan

Category-based statistic language model is an important method to solve the problem of sparse data. But there are two bottlenecks about this model: (1) the problem of word clustering, it is hard to find a suitable clustering method that has good performance and not large amount of computation. (2) class based method always lose some prediction ability to adapt the text of different domain. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, this paper presents a new method to create the vari-gram model.


2019 ◽  
Vol 5 (2) ◽  
pp. 164
Author(s):  
Herry Sujaini

We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2017 ◽  
Vol 5 (12) ◽  
pp. 323-325
Author(s):  
E. Mahima Jane ◽  
◽  
◽  
E. George Dharma Prakash Raj

2015 ◽  
pp. 125-138 ◽  
Author(s):  
I. V. Goncharenko

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.


2012 ◽  
Vol 12 ◽  
Author(s):  
Amanda Post Silveira

This is a preliminary study in which we investigate the acquisition of English as second language (L2[1]) word stress by native speakers of Brazilian Portuguese (BP, L1[2]). In this paper, we show results of a multiple choice forced choice perception test in which native speakers of American English and native speakers of Dutch judged the production of English words bearing pre-final stress that were both cognates and non-cognates with BP words. The tokens were produced by native speakers of American English and by Brazilians that speak English as a second language. The results have shown that American and Dutch listeners were consistent in their judgments on native and non-native stress productions and both speakers' groups produced variation in stress in relation to the canonical pattern. However, the variability found in American English points to the prosodic patterns of English and the variability found in Brazilian English points to the stress patterns of Portuguese. It occurs especially in words whose forms activate neighboring similar words in the L1. Transfer from the L1 appears both at segmental and prosodic levels in BP English. [1] L2 stands for second language, foreign language, target language. [2] L1 stands for first language, mother tongue, source language.


Sign in / Sign up

Export Citation Format

Share Document