An HMM-Based PoS Tagger for Old Church Slavonic

Abstract We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

Download Full-text

Corpus based learning of stochastic, context-free grammars combined with Hidden Markov Models for tRNA modelling

International Journal of Bioinformatics Research and Applications ◽

10.1504/ijbra.2006.007908 ◽

2006 ◽

Vol 1 (3) ◽

pp. 305

Author(s):

Juan Miguel Garcia-Gomez ◽

Jose Miguel Benedi ◽

Javier Vicente ◽

Montserrat Robles

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Stochastic Context Free Grammars ◽

Context Free ◽

Context Free Grammars

Download Full-text

Off-line isolated handwritten Thai OCR using island-based projection with n-gram model and hidden Markov models

Information Processing & Management ◽

10.1016/j.ipm.2004.04.011 ◽

2005 ◽

Vol 41 (1) ◽

pp. 139-160 ◽

Cited By ~ 14

Author(s):

Thanaruk Theeramunkong ◽

Chainat Wongtapan

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

N Gram

Download Full-text

HMMoce: An R package for improved geolocation of archival‐tagged fishes using a hidden Markov method

Methods in Ecology and Evolution ◽

10.1111/2041-210x.12959 ◽

2018 ◽

Vol 9 (5) ◽

pp. 1212-1220 ◽

Cited By ~ 19

Author(s):

Camrin D. Braun ◽

Benjamin Galuardi ◽

Simon R. Thorrold

Keyword(s):

Hidden Markov ◽

R Package ◽

Markov Method

Download Full-text

RNA MODELING BY COMBINING STOCHASTIC CONTEXT-FREE GRAMMARS AND n-GRAM MODELS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001402001691 ◽

2002 ◽

Vol 16 (03) ◽

pp. 309-315 ◽

Cited By ~ 15

Author(s):

ISMAEL SALVADOR ◽

JOSÉ-MIGUEL BENEDÍ

Keyword(s):

Rna Sequences ◽

Global Relation ◽

Stochastic Version ◽

Pairwise Correlations ◽

N Gram ◽

Stochastic Context Free Grammars ◽

Context Free ◽

Context Free Grammars

The RNA sentences present structured regions caused by pairwise correlations, and nonstructured regions where any global relation can be found. In this paper, we present a combination of stochastic context-free grammars (SCFG) and bigram models. The SCFGs are used to represent the long-term relations of the structured part of RNA sequences, while the bigram models are used to capture the local relations of the nonstructured part. A stochastic version of Sakakibara's algorithm is used to study the SCFGs. Finally, experiments to evaluate the behavior of this proposal were carried out.

Download Full-text

Hidden-Markov Methods for the Analysis of Single-Molecule Actomyosin Displacement Data: The Variance-Hidden-Markov Method

Biophysical Journal ◽

10.1016/s0006-3495(01)75922-x ◽

2001 ◽

Vol 81 (5) ◽

pp. 2795-2816 ◽

Cited By ~ 45

Author(s):

David A. Smith ◽

Walter Steffen ◽

Robert M. Simmons ◽

John Sleep

Keyword(s):

Single Molecule ◽

Hidden Markov ◽

Markov Method

Download Full-text

Applying N-gram Alignment Entropy to Improve Feature Decay Algorithms

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0024 ◽

2017 ◽

Vol 108 (1) ◽

pp. 245-256 ◽

Cited By ~ 3

Author(s):

Alberto Poncelas ◽

Gideon Maillette de Buy Wenniger ◽

Andy Way

Keyword(s):

Machine Translation ◽

English Translation ◽

Data Selection ◽

Test Set ◽

Specific Entropy ◽

Decay Factor ◽

N Gram ◽

Training Examples ◽

Linear Decay

AbstractData Selection is a popular step in Machine Translation pipelines. Feature Decay Algorithms (FDA) is a technique for data selection that has shown a good performance in several tasks. FDA aims to maximize the coverage of n-grams in the test set. However, intuitively, more ambiguous n-grams require more training examples in order to adequately estimate their translation probabilities. This ambiguity can be measured by alignment entropy. In this paper we propose two methods for calculating the alignment entropies for n-grams of any size, which can be used for improving the performance of FDA. We evaluate the substitution of the n-gram-specific entropy values computed by these methods to the parameters of both the exponential and linear decay factor of FDA. The experiments conducted on German-to-English and Czech-to-English translation demonstrate that the use of alignment entropies can lead to an increase in the quality of the results of FDA.

Download Full-text

LSTM-Based with Deterministic Negative Sampling for API Suggestion

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194019500347 ◽

2019 ◽

Vol 29 (07) ◽

pp. 1029-1051 ◽

Cited By ~ 2

Author(s):

Jinpei Yan ◽

Yong Qi ◽

Qifan Rao ◽

Hui He ◽

Saiyu Qi

Keyword(s):

Short Term Memory ◽

Hidden Markov ◽

Specific Model ◽

Great Effort ◽

Short Term ◽

Term Memory ◽

Computation Efficiency ◽

Long Short Term Memory ◽

N Gram ◽

Lstm Network

Modern programming relies on a large number of fundamental APIs, but programmers often take great effort to remember names and the usage of APIs when coding, and repeatedly search the related API documents or Q&A websites (e.g. Stack Overflow). To improve the programming efficiency, we present a Java API suggestion model called APIHelper which learns API sequence pattern via the Long Short-Term Memory (LSTM) network, then provides API suggestion based on the program context. Comparing with statistical methods (e.g. Hidden Markov Model (HMM), N-gram), which require establishing one specific model for each class, we propose Deterministic Negative Sampling (DNS) to make API suggestion for a large number of Java classes by one single end-to-end LSTM. To verify this approach, we make API suggestion for 50,000 Java classes and evaluate it with accuracy and top-K accuracy. The results show that APIHelper outperforms other research works both on accuracy and computation efficiency.

Download Full-text

Use of machine learning to identify relevant research publications in clinical oncology.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.6558 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. 6558-6558

Author(s):

Fernando Jose Suarez Saiz ◽

Corey Sanders ◽

Rick J Stevens ◽

Robert Nielsen ◽

Michael W Britt ◽

...

Keyword(s):

Machine Learning ◽

Expert Knowledge ◽

Grey Literature ◽

Clinical Decision ◽

Training Data ◽

Surgical Approaches ◽

High Quality ◽

Test Set ◽

N Gram ◽

Abstract Content

6558 Background: Finding high-quality science to support decisions for individual patients is challenging. Common approaches to assess clinical literature quality and relevance rely on bibliometrics or expert knowledge. We describe a method to automatically identify clinically relevant, high-quality scientific citations using abstract content. Methods: We used machine learning trained on text from PubMed papers cited in 3 expert resources: NCCN, NCI-PDQ, and Hemonc.org. Balanced training data included text cited in at least two sources to form an “on topic” set (i.e., relevant and high quality), and an “off-topic” set, not cited in any of the above 3 sources. The off-topic set was published in lower ranked journals, using a citation-based score. Articles were part of an Oncology Clinical Trial corpus generated using a standard PubMed query. We used a gradient boosted-tree approach with a binary logistic supervised learning classification. Briefly, 988 texts were processed to produce a term frequency-inverse document frequency (tf-idf) n-gram representation of both the training and the test set (70/30 split). Ideal parameters were determined using 1000-fold cross validation. Results: Our model classified papers in the test set with 0.93 accuracy (95% CI (0.09:0.96) p ≤ 0.0001), with sensitivity 0.95 and specificity 0.91. Some false positives contained language considered clinically relevant that may have been missed or not yet included in expert resources. False negatives revealed a potential bias towards chemotherapy-focused research over radiation therapy or surgical approaches. Conclusions: Machine learning can be used to automatically identify relevant clinical publications from biographic databases, without relying on expert curation or bibliometric methods. The use of machine learning to identify relevant publications may reduce the time clinicians spend finding pertinent evidence for a patient. This approach is generalizable to cases where a corpus of high-quality publications that can serve as a training set exists or cases where document metadata is unreliable, as is the case of “grey” literature within oncology and beyond to other diseases. Future work will extend this approach and may integrate it into oncology clinical decision-support tools.

Download Full-text

STRUCTURAL HIDDEN MARKOV MODELS BASED ON STOCHASTIC CONTEXT-FREE GRAMMARS

Control and Intelligent Systems ◽

10.2316/journal.201.2007.3.201-1665 ◽

2007 ◽

Vol 35 (3) ◽

Cited By ~ 1

Author(s):

D. Bouchaffra ◽

J. Tan

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Stochastic Context Free Grammars ◽

Context Free ◽

Context Free Grammars

Download Full-text

Development of Part of Speech Tagger for Assamese Using HMM

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch054 ◽

2020 ◽

pp. 1139-1148

Author(s):

Surjya Kanta Daimary ◽

Vishal Goyal ◽

Madhumita Barbora ◽

Umrinderpal Singh

Keyword(s):

Language Processing ◽

Stochastic Models ◽

Hidden Markov ◽

Stochastic Approach ◽

Point Of View ◽

Part Of Speech ◽

Pos Tagger ◽

Asian Languages ◽

Free Word ◽

Assamese Language

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.

Download Full-text