A top-down information theoretic word clustering algorithm for phrase recognition

2014 ◽  
Vol 275 ◽  
pp. 213-225 ◽  
Author(s):  
Yu-Chieh Wu
Author(s):  
Herry Sujaini

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.


2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2011 ◽  
Vol 2 (1) ◽  
pp. 42-48 ◽  
Author(s):  
P. C. W. Davies

Genes store heritable information, but actual gene expression often depends on many so-called epigenetic factors, both physical and chemical, external to DNA. Epigenetic changes can be both reversible and heritable. The genome is associated with a physical object (DNA) with a specific location, whereas the epigenome is a global, systemic, entity. Furthermore, genomic information is tied to specific coded molecular sequences stored in DNA. Although epigenomic information can be associated with certain non-DNA molecular sequences, it is mostly not. Therefore, there does not seem to be a stored ‘epigenetic programme’ in the information-theoretic sense. Instead, epigenomic control is—to a large extent—an emergent self-organizing phenomenon, and the real-time operation of the epigenetic ‘project’ lies in the realm of nonlinear bifurcations, interlocking feedback loops, distributed networks, top-down causation and other concepts familiar from the complex systems theory. Lying at the heart of vital eukaryotic processes are chromatin structure, organization and dynamics. Epigenetics provides striking examples of how bottom-up genetic and top-down epigenetic causation intermingle. The fundamental question then arises of how causal efficacy should be attributed to biological information. A proposal is made to implement explicit downward causation by coupling information directly to the dynamics of chromatin, thus permitting the coevolution of dynamical laws and states, and opening up a new sector of dynamical systems theory that promises to display rich self-organizing and self-complexifying behaviour.


2019 ◽  
Vol 28 (1) ◽  
pp. 15-30 ◽  
Author(s):  
Rakesh Patra ◽  
Sujan Kumar Saha

Abstract In this paper, we present a novel word clustering technique to capture contextual similarity among the words. Related word clustering techniques in the literature rely on the statistics of the words collected from a fixed and small word window. For example, the Brown clustering algorithm is based on bigram statistics of the words. However, in the sequential labeling tasks such as named entity recognition (NER), longer context words also carry valuable information. To capture this longer context information, we propose a new word clustering algorithm, which uses parse information of the sentences and a nonfixed word window. This proposed clustering algorithm, named as variable window clustering, performs better than Brown clustering in our experiments. Additionally, to use two different clustering techniques simultaneously in a classifier, we propose a cluster merging technique that performs an output level merging of two sets of clusters. To test the effectiveness of the approaches, we use two different NER data sets, namely, Hindi and BioCreative II Gene Mention Recognition. A baseline NER system is developed using conditional random fields classifier, and then the clusters using individual techniques as well as the merged technique are incorporated to improve the classifier. Experimental results demonstrate that the cluster merging technique is quite promising.


2014 ◽  
Vol 26 (9) ◽  
pp. 2074-2101 ◽  
Author(s):  
Hideitsu Hino ◽  
Noboru Murata

Clustering is a representative of unsupervised learning and one of the important approaches in exploratory data analysis. By its very nature, clustering without strong assumption on data distribution is desirable. Information-theoretic clustering is a class of clustering methods that optimize information-theoretic quantities such as entropy and mutual information. These quantities can be estimated in a nonparametric manner, and information-theoretic clustering algorithms are capable of capturing various intrinsic data structures. It is also possible to estimate information-theoretic quantities using a data set with sampling weight for each datum. Assuming the data set is sampled from a certain cluster and assigning different sampling weights depending on the clusters, the cluster-conditional information-theoretic quantities are estimated. In this letter, a simple iterative clustering algorithm is proposed based on a nonparametric estimator of the log likelihood for weighted data sets. The clustering algorithm is also derived from the principle of conditional entropy minimization with maximum entropy regularization. The proposed algorithm does not contain a tuning parameter. The algorithm is experimentally shown to be comparable to or outperform conventional nonparametric clustering methods.


Sign in / Sign up

Export Citation Format

Share Document