A Scalable Distributed Syntactic, Semantic, and Lexical Language Model

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.

Download Full-text

A Comparison of Phrase Based and Word based Language Model for Punjabi

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0232 ◽

2017 ◽

Vol 7 (7) ◽

pp. 444

Author(s):

Umrinderpal Singh

Keyword(s):

Machine Translation ◽

Great Improvement ◽

Language Model ◽

Language Models ◽

Translation System ◽

Information Base ◽

Model Yield ◽

Machine Translation System ◽

N Gram

A language model provides connection to the decoding process to determine a precise word from several available options in the information base or phrase table. The language model can be generated using n-gram approach. Various language models and smoothing procedures are there to determine this model, like unigram, bigram, trigram, interpolation, backoff language model etc. We have done some experiments with different language models where we have used phrases in place of words as the smallest unit. Experiments have shown that phrase based language model yield more accurate results as compared to simple word based mode. We have also done some experiments with machine translation system where we have used phrase based language model rather than word based model and system yield great improvement.

Download Full-text

Semi-Supervised Semantic Role Labeling via Structural Alignment

Computational Linguistics ◽

10.1162/coli_a_00087 ◽

2012 ◽

Vol 38 (1) ◽

pp. 135-171 ◽

Cited By ~ 8

Author(s):

Hagen Fürstenau ◽

Mirella Lapata

Keyword(s):

High Performance ◽

Large Scale ◽

Semantic Analysis ◽

Syntactic Structure ◽

Structural Alignment ◽

Semantic Role ◽

Semantic Role Labeling ◽

Improve Performance ◽

Underlying Assumption ◽

Alignment Problem

Large-scale annotated corpora are a prerequisite to developing high-performance semantic role labeling systems. Unfortunately, such corpora are expensive to produce, limited in size, and may not be representative. Our work aims to reduce the annotation effort involved in creating resources for semantic role labeling via semi-supervised learning. The key idea of our approach is to find novel instances for classifier training based on their similarity to manually labeled seed instances. The underlying assumption is that sentences that are similar in their lexical material and syntactic structure are likely to share a frame semantic analysis. We formalize the detection of similar sentences and the projection of role annotations as a graph alignment problem, which we solve exactly using integer linear programming. Experimental results on semantic role labeling show that the automatic annotations produced by our method improve performance over using hand-labeled instances alone.

Download Full-text

Penerapan Information Retrieval Menggunakan Pemodelan Topik Pada Deskripsi Portal Multimedia

Jurnal Nasional Komputasi dan Teknologi Informasi (JNKTI) ◽

10.32672/jnkti.v2i1.1057 ◽

2019 ◽

Vol 2 (1) ◽

pp. 48

Author(s):

Indra Gita Anugrah ◽

Harunur Rosyid

Keyword(s):

Information Retrieval ◽

Em Algorithm ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Naive Bayes ◽

Naïve Bayes ◽

Probabilistic Latent Semantic Analysis

Pesatnya perkembangan teknologi informasi saat ini, diikuti meningkatnya perkembangan data. Data merupakan informasi yang sangat berharga perkembangan yang semakin pesat menyebabkan kesulitan dalam pengelolaannya. Salah satu pemanfaatan data adalah penggunaan temu kembali informasi pada portal video multimedia. Semakin banyak video multimedia yang tersimpan pada repositori maka semakin sulit dalam proses pencarian. Pada proses pencarian, pengguna terkadang menginginkan korelasi diantara hasil pencarian. Untuk membentuk korelasi dari hasil pencarian, dibutuhkan sebuah pemodelan topik yang berfungsi sebagai penghubung diantara query, kata dan dokumen dari deskripsi video multimedia. Salah satu metode pemodelan topik dapat dilakukan menggunakan model Probabilistic Latent Semantic Analysis (PLSA) dengan algoritma Expectation dan Maximization (EM Algorithm). Algoritma EM merupakan algoritma untuk menduga suatu parameter, tahap awal adalah melakukan pencarian nilai ekspektasi (Expectation). Pencarian nilai ekspektasi membutuhkan topik sebagai parameter awal yang nilai parameter-parameter akan diperbaharui menggunakan algoritma Maximization. Proses pembentukan parameter awal dilakukan menggunakan algoritma Naive Bayes, dimana algoritma Naive Bayes digunakan memprediksi kejadian dimasa datang menggunakan pengalaman sebelumnya.

Download Full-text

A HYBRID LANGUAGE MODEL BASED ON STATISTICS AND LINGUISTIC RULES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001405003934 ◽

2005 ◽

Vol 19 (01) ◽

pp. 109-128 ◽

Cited By ~ 2

Author(s):

XIAOLONG WANG ◽

DANIEL S. YEUNG ◽

JAMES N. K. LIU ◽

ROBERT LUK ◽

XUAN WANG

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Large Scale ◽

Language Model ◽

Language Models ◽

Long Distance ◽

Language Understanding ◽

Linguistic Rules ◽

Hybrid Language ◽

N Gram

Language modeling is a current research topic in many domains including speech recognition, optical character recognition, handwriting recognition, machine translation and spelling correction. There are two main types of language models, the mathematical and the linguistic. The most widely used mathematical language model is the n-gram model inferred from statistics. This model has three problems: long distance restriction, recursive nature and partial language understanding. Language models based on linguistics present many difficulties when applied to large scale real texts. We present here a new hybrid language model that combines the advantages of the n-gram statistical language model with those of a linguistic language model which makes use of grammatical or semantic rules. Using suitable rules, this hybrid model can solve problems such as long distance restriction, recursive nature and partial language understanding. The new language model has been effective in experiments and has been incorporated in Chinese sentence input products for Windows and Macintosh OS.

Download Full-text

N-gram based Language Model for the QWERTY Keyboard Input Errors in a Touch Screen Environment

Korean Institute of Smart Media ◽

10.30693/smj.2018.7.2.54 ◽

2018 ◽

Vol 7 (2) ◽

pp. 54-59

Author(s):

Yoon Gee Ong ◽

◽

Seung Shik Kang ◽

Keyword(s):

Language Model ◽

Touch Screen ◽

Keyboard Input ◽

N Gram

Download Full-text

Web clustering based on hybrid probabilistic latent semantic analysis model

Journal of Computer Applications ◽

10.3724/sp.j.1087.2012.03018 ◽

2013 ◽

Vol 32 (11) ◽

pp. 3018-3022 ◽

Cited By ~ 1

Author(s):

Zhi-he WANG ◽

Ling-yun WANG ◽

Hui DANG ◽

Li-na PAN

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Analysis Model ◽

Web Clustering

Download Full-text

Recommendation research based on general content probabilistic latent semantic analysis model

Journal of Computer Applications ◽

10.3724/sp.j.1087.2013.01330 ◽

2013 ◽

Vol 33 (5) ◽

pp. 1330-1333 ◽

Cited By ~ 1

Author(s):

Wei ZHANG ◽

Wei HUANG ◽

Limin XIA

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Analysis Model

Download Full-text

Equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09 ◽

10.1145/1571941.1572069 ◽

2009 ◽

Cited By ~ 2

Author(s):

Wei Peng

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Tensor Factorization ◽

Nonnegative Tensor ◽

Nonnegative Tensor Factorization

Download Full-text

Topic-based Amharic text summarization with probabilistic latent semantic analysis

Proceedings of the International Conference on Management of Emergent Digital EcoSystems - MEDES '12 ◽

10.1145/2457276.2457279 ◽

2012 ◽

Author(s):

Eyob Delele Yirdaw ◽

Dejene Ejigu

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Text Summarization ◽

Probabilistic Latent Semantic Analysis

Download Full-text

A System to Assess the Semantic Content of Student Essays

Journal of Educational Computing Research ◽

10.2190/g649-0r9c-c021-p6x3 ◽

2001 ◽

Vol 24 (3) ◽

pp. 305-320 ◽

Cited By ~ 35

Author(s):

Benoit Lemaire ◽

Philippe Dessus

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Dimensional Space ◽

Semantic Content ◽

High Dimensional ◽

High Dimensional Space

This paper presents Apex, a system that can automatically assess a student essay based on its content. It relies on Latent Semantic Analysis, a tool which is used to represent the meaning of words as vectors in a high-dimensional space. By comparing an essay and the text of a given course on a semantic basis, our system can measure how well the essay matches the text. Various assessments are presented to the student regarding the topic, the outline and the coherence of the essay. Our experiments yield promising results.

Download Full-text