statistical language model Latest Research Papers

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483524 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-21

Author(s):

Xianwen Liao ◽

Yongzhong Huang ◽

Peng Yang ◽

Lei Chen

Keyword(s):

Language Model ◽

Dynamic Programming Algorithm ◽

Named Entity Recognition ◽

Word Segmentation ◽

Training Data ◽

Entity Recognition ◽

Divide And Conquer ◽

Programming Algorithm ◽

Statistical Language Model ◽

Sequence Labeling

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

Personalized Semantic Retrieval System based on Statistical Language Model

10.1109/icisfall51598.2021.9627486 ◽

2021 ◽

Author(s):

Xianghao Meng ◽

Dongmei Li ◽

Qichen Han

Keyword(s):

Retrieval System ◽

Language Model ◽

Semantic Retrieval ◽

Statistical Language Model

Corpus-based technique for improving Arabic OCR system

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i1.pp233-241 ◽

2021 ◽

Vol 21 (1) ◽

pp. 233

Author(s):

Ahmed Hussain Aliwy ◽

Basheer Al-Sadawi

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Language Model ◽

Arabic Language ◽

Document Images ◽

Statistical Language Model ◽

Text Document ◽

Optical Character ◽

Arabic Ocr

<p><span>An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output. </span></p>

Impact of Statistical Language Model on Example Based Machine Translation System between Kazakh and Turkish Languages

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443286 ◽

2020 ◽

Author(s):

Gulshat Kessikbayeva ◽

Ilyas Cicekli

Keyword(s):

Machine Translation ◽

Language Model ◽

Translation System ◽

Statistical Language Model ◽

Machine Translation System

A Source Code Similarity Based on Siamese Neural Network

Applied Sciences ◽

10.3390/app10217519 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7519

Author(s):

Chunli Xie ◽

Xia Wang ◽

Cheng Qian ◽

Mengqi Wang

Keyword(s):

Neural Network ◽

Semantic Information ◽

Language Model ◽

Similarity Score ◽

Statistical Language Model ◽

Inverse Document Frequency ◽

Continuous Space ◽

Document Frequency ◽

Space Vectors ◽

Similar Code

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Research on the Improved Word2Vec Optimization Strategy Based on Statistical Language Model

2020 International Conference on Information Science, Parallel and Distributed Systems (ISPDS) ◽

10.1109/ispds51347.2020.00082 ◽

2020 ◽

Author(s):

Shi Lei

Keyword(s):

Language Model ◽

Optimization Strategy ◽

Statistical Language Model

Hierarchical Gated Recurrent Unit with Semantic Attention for Event Prediction

Future Internet ◽

10.3390/fi12020039 ◽

2020 ◽

Vol 12 (2) ◽

pp. 39

Author(s):

Zichun Su ◽

Jialin Jiang

Keyword(s):

Time Series ◽

Prior Knowledge ◽

Financial Risk ◽

State Of The Art ◽

Language Model ◽

Statistical Language Model ◽

Event Prediction ◽

Event Information ◽

The Impact ◽

Gated Recurrent Unit

Event prediction plays an important role in financial risk assessment and disaster warning, which can help government decision-making and economic investment. Previous works are mainly based on time series for event prediction such as statistical language model and recurrent neural network, while ignoring the impact of prior knowledge on event prediction. This makes the direction of event prediction often biased or wrong. In this paper, we propose a hierarchical event prediction model based on time series and prior knowledge. To ensure the accuracy of the event prediction, the model obtains the time-based event information and prior knowledge of events by Gated Recurrent Unit and Associated Link Network respectively. The semantic selective attention mechanism is used to fuse the time-based event information and prior knowledge, and finally generate predicted events. Experimental results on Chinese News datasets demonstrate that our model significantly outperforms the state-of-the-art methods, and increases the accuracy by 2.8%.

Combining Program Analysis and Statistical Language Model for Code Statement Completion

2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) ◽

10.1109/ase.2019.00072 ◽

2019 ◽

Cited By ~ 1

Author(s):

Son Nguyen ◽

Tien Nguyen ◽

Yi Li ◽

Shaohua Wang

Keyword(s):

Program Analysis ◽

Language Model ◽

Statistical Language Model

Construction and Application of the English Corpus Based on the Statistical Language Model

Lecture Notes in Electrical Engineering - Frontier Computing ◽

10.1007/978-981-13-3648-5_82 ◽

2019 ◽

pp. 665-670

Author(s):

Bo Zhang

Keyword(s):

Language Model ◽

Statistical Language Model

SLAMPA: Recommending Code Snippets with Statistical Language Model

2018 25th Asia-Pacific Software Engineering Conference (APSEC) ◽

10.1109/apsec.2018.00022 ◽

2018 ◽

Cited By ~ 1

Author(s):

Shufan Zhou ◽

Hao Zhong ◽

Beijun Shen

Keyword(s):

Language Model ◽

Statistical Language Model

statistical language model
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

Personalized Semantic Retrieval System based on Statistical Language Model

Corpus-based technique for improving Arabic OCR system

Impact of Statistical Language Model on Example Based Machine Translation System between Kazakh and Turkish Languages

A Source Code Similarity Based on Siamese Neural Network

Research on the Improved Word2Vec Optimization Strategy Based on Statistical Language Model

Hierarchical Gated Recurrent Unit with Semantic Attention for Event Prediction

Combining Program Analysis and Statistical Language Model for Code Statement Completion

Construction and Application of the English Corpus Based on the Statistical Language Model

SLAMPA: Recommending Code Snippets with Statistical Language Model

Export Citation Format

statistical language modelRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

Personalized Semantic Retrieval System based on Statistical Language Model

Corpus-based technique for improving Arabic OCR system

Impact of Statistical Language Model on Example Based Machine Translation System between Kazakh and Turkish Languages

A Source Code Similarity Based on Siamese Neural Network

Research on the Improved Word2Vec Optimization Strategy Based on Statistical Language Model

Hierarchical Gated Recurrent Unit with Semantic Attention for Event Prediction

Combining Program Analysis and Statistical Language Model for Code Statement Completion

Construction and Application of the English Corpus Based on the Statistical Language Model

SLAMPA: Recommending Code Snippets with Statistical Language Model

statistical language model
Recently Published Documents