scholarly journals A Scalable Distributed Syntactic, Semantic, and Lexical Language Model

2012 ◽  
Vol 38 (3) ◽  
pp. 631-671 ◽  
Author(s):  
Ming Tan ◽  
Wenli Zhou ◽  
Lei Zheng ◽  
Shaojun Wang

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.

Author(s):  
Umrinderpal Singh

A language model provides connection to the decoding process to determine a precise word from several available options in the information base or phrase table. The language model can be generated using n-gram approach. Various language models and smoothing procedures are there to determine this model, like unigram, bigram, trigram, interpolation, backoff language model etc. We have done some experiments with different language models where we have used phrases in place of words as the smallest unit. Experiments have shown that phrase based language model yield more accurate results as compared to simple word based mode. We have also done some experiments with machine translation system where we have used phrase based language model rather than word based model and system yield great improvement.


2012 ◽  
Vol 38 (1) ◽  
pp. 135-171 ◽  
Author(s):  
Hagen Fürstenau ◽  
Mirella Lapata

Large-scale annotated corpora are a prerequisite to developing high-performance semantic role labeling systems. Unfortunately, such corpora are expensive to produce, limited in size, and may not be representative. Our work aims to reduce the annotation effort involved in creating resources for semantic role labeling via semi-supervised learning. The key idea of our approach is to find novel instances for classifier training based on their similarity to manually labeled seed instances. The underlying assumption is that sentences that are similar in their lexical material and syntactic structure are likely to share a frame semantic analysis. We formalize the detection of similar sentences and the projection of role annotations as a graph alignment problem, which we solve exactly using integer linear programming. Experimental results on semantic role labeling show that the automatic annotations produced by our method improve performance over using hand-labeled instances alone.


Author(s):  
Indra Gita Anugrah ◽  
Harunur Rosyid

<p>Pesatnya perkembangan teknologi informasi saat ini, diikuti meningkatnya perkembangan data. Data merupakan informasi yang sangat berharga perkembangan yang semakin pesat menyebabkan kesulitan dalam pengelolaannya. Salah satu pemanfaatan data adalah penggunaan temu kembali informasi pada portal video multimedia. Semakin banyak video multimedia yang tersimpan pada repositori maka semakin sulit dalam proses pencarian. Pada proses pencarian, pengguna terkadang menginginkan korelasi diantara hasil pencarian. Untuk membentuk korelasi dari hasil pencarian, dibutuhkan sebuah pemodelan topik yang berfungsi sebagai penghubung diantara query, kata dan dokumen dari deskripsi video multimedia. Salah satu metode pemodelan topik dapat dilakukan menggunakan model <em>Probabilistic Latent Semantic Analysis</em> <em>(PLSA)</em> dengan algoritma <em>Expectation dan Maximization (EM Algorithm)</em>. Algoritma EM merupakan algoritma untuk menduga suatu parameter, tahap awal adalah melakukan pencarian nilai ekspektasi <em>(Expectation).</em> Pencarian nilai ekspektasi membutuhkan topik sebagai parameter awal yang nilai parameter-parameter akan diperbaharui menggunakan algoritma <em>Maximization</em>. Proses pembentukan parameter awal dilakukan menggunakan algoritma <em>Naive Bayes</em>, dimana algoritma Naive Bayes digunakan memprediksi kejadian dimasa datang menggunakan pengalaman sebelumnya.</p>


Author(s):  
XIAOLONG WANG ◽  
DANIEL S. YEUNG ◽  
JAMES N. K. LIU ◽  
ROBERT LUK ◽  
XUAN WANG

Language modeling is a current research topic in many domains including speech recognition, optical character recognition, handwriting recognition, machine translation and spelling correction. There are two main types of language models, the mathematical and the linguistic. The most widely used mathematical language model is the n-gram model inferred from statistics. This model has three problems: long distance restriction, recursive nature and partial language understanding. Language models based on linguistics present many difficulties when applied to large scale real texts. We present here a new hybrid language model that combines the advantages of the n-gram statistical language model with those of a linguistic language model which makes use of grammatical or semantic rules. Using suitable rules, this hybrid model can solve problems such as long distance restriction, recursive nature and partial language understanding. The new language model has been effective in experiments and has been incorporated in Chinese sentence input products for Windows and Macintosh OS.


2001 ◽  
Vol 24 (3) ◽  
pp. 305-320 ◽  
Author(s):  
Benoit Lemaire ◽  
Philippe Dessus

This paper presents Apex, a system that can automatically assess a student essay based on its content. It relies on Latent Semantic Analysis, a tool which is used to represent the meaning of words as vectors in a high-dimensional space. By comparing an essay and the text of a given course on a semantic basis, our system can measure how well the essay matches the text. Various assessments are presented to the student regarding the topic, the outline and the coherence of the essay. Our experiments yield promising results.


Sign in / Sign up

Export Citation Format

Share Document