scholarly journals An HMM-Based PoS Tagger for Old Church Slavonic

2021 ◽  
Vol 72 (2) ◽  
pp. 556-567
Author(s):  
Olga Lyashevskaya ◽  
Ilia Afanasev

Abstract We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as a within-domain test set, and Kiev Folia is used as an out-of-domain test set. Analysing by-PoS-class precision and sensitivity in each run, we combine a simple context-free n-gram-based approach and Hidden Markov method (HMM), and added linguistic rules for specific cases such as punctuation and digits. While the model achieves a rather non-impressive accuracy of 81% in in-domain settings, we observe an accuracy of 51% in out-of-domain evaluation, which is comparable to the results of large neural architectures based on pre-trained contextual embeddings.

2018 ◽  
Vol 9 (5) ◽  
pp. 1212-1220 ◽  
Author(s):  
Camrin D. Braun ◽  
Benjamin Galuardi ◽  
Simon R. Thorrold

Author(s):  
ISMAEL SALVADOR ◽  
JOSÉ-MIGUEL BENEDÍ

The RNA sentences present structured regions caused by pairwise correlations, and nonstructured regions where any global relation can be found. In this paper, we present a combination of stochastic context-free grammars (SCFG) and bigram models. The SCFGs are used to represent the long-term relations of the structured part of RNA sequences, while the bigram models are used to capture the local relations of the nonstructured part. A stochastic version of Sakakibara's algorithm is used to study the SCFGs. Finally, experiments to evaluate the behavior of this proposal were carried out.


2001 ◽  
Vol 81 (5) ◽  
pp. 2795-2816 ◽  
Author(s):  
David A. Smith ◽  
Walter Steffen ◽  
Robert M. Simmons ◽  
John Sleep

2017 ◽  
Vol 108 (1) ◽  
pp. 245-256 ◽  
Author(s):  
Alberto Poncelas ◽  
Gideon Maillette de Buy Wenniger ◽  
Andy Way

AbstractData Selection is a popular step in Machine Translation pipelines. Feature Decay Algorithms (FDA) is a technique for data selection that has shown a good performance in several tasks. FDA aims to maximize the coverage of n-grams in the test set. However, intuitively, more ambiguous n-grams require more training examples in order to adequately estimate their translation probabilities. This ambiguity can be measured by alignment entropy. In this paper we propose two methods for calculating the alignment entropies for n-grams of any size, which can be used for improving the performance of FDA. We evaluate the substitution of the n-gram-specific entropy values computed by these methods to the parameters of both the exponential and linear decay factor of FDA. The experiments conducted on German-to-English and Czech-to-English translation demonstrate that the use of alignment entropies can lead to an increase in the quality of the results of FDA.


Author(s):  
Jinpei Yan ◽  
Yong Qi ◽  
Qifan Rao ◽  
Hui He ◽  
Saiyu Qi

Modern programming relies on a large number of fundamental APIs, but programmers often take great effort to remember names and the usage of APIs when coding, and repeatedly search the related API documents or Q&A websites (e.g. Stack Overflow). To improve the programming efficiency, we present a Java API suggestion model called APIHelper which learns API sequence pattern via the Long Short-Term Memory (LSTM) network, then provides API suggestion based on the program context. Comparing with statistical methods (e.g. Hidden Markov Model (HMM), N-gram), which require establishing one specific model for each class, we propose Deterministic Negative Sampling (DNS) to make API suggestion for a large number of Java classes by one single end-to-end LSTM. To verify this approach, we make API suggestion for 50,000 Java classes and evaluate it with accuracy and top-K accuracy. The results show that APIHelper outperforms other research works both on accuracy and computation efficiency.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 6558-6558
Author(s):  
Fernando Jose Suarez Saiz ◽  
Corey Sanders ◽  
Rick J Stevens ◽  
Robert Nielsen ◽  
Michael W Britt ◽  
...  

6558 Background: Finding high-quality science to support decisions for individual patients is challenging. Common approaches to assess clinical literature quality and relevance rely on bibliometrics or expert knowledge. We describe a method to automatically identify clinically relevant, high-quality scientific citations using abstract content. Methods: We used machine learning trained on text from PubMed papers cited in 3 expert resources: NCCN, NCI-PDQ, and Hemonc.org. Balanced training data included text cited in at least two sources to form an “on topic” set (i.e., relevant and high quality), and an “off-topic” set, not cited in any of the above 3 sources. The off-topic set was published in lower ranked journals, using a citation-based score. Articles were part of an Oncology Clinical Trial corpus generated using a standard PubMed query. We used a gradient boosted-tree approach with a binary logistic supervised learning classification. Briefly, 988 texts were processed to produce a term frequency-inverse document frequency (tf-idf) n-gram representation of both the training and the test set (70/30 split). Ideal parameters were determined using 1000-fold cross validation. Results: Our model classified papers in the test set with 0.93 accuracy (95% CI (0.09:0.96) p ≤ 0.0001), with sensitivity 0.95 and specificity 0.91. Some false positives contained language considered clinically relevant that may have been missed or not yet included in expert resources. False negatives revealed a potential bias towards chemotherapy-focused research over radiation therapy or surgical approaches. Conclusions: Machine learning can be used to automatically identify relevant clinical publications from biographic databases, without relying on expert curation or bibliometric methods. The use of machine learning to identify relevant publications may reduce the time clinicians spend finding pertinent evidence for a patient. This approach is generalizable to cases where a corpus of high-quality publications that can serve as a training set exists or cases where document metadata is unreliable, as is the case of “grey” literature within oncology and beyond to other diseases. Future work will extend this approach and may integrate it into oncology clinical decision-support tools.


2020 ◽  
pp. 1139-1148
Author(s):  
Surjya Kanta Daimary ◽  
Vishal Goyal ◽  
Madhumita Barbora ◽  
Umrinderpal Singh

This article presents the work on the Part-of-Speech Tagger for Assamese based on Hidden Markov Model (HMM). Over the years, a lot of language processing tasks have been done for Western and South-Asian languages. However, very little work is done for Assamese language. So, with this point of view, the POS Tagger for Assamese using Stochastic Approach is being developed. Assamese is a free word-order, highly agglutinate and morphological rich language, thus developing POS Tagger with good accuracy will help in development of other NLP task for Assamese. For this work, an annotated corpus of 271,890 words with a BIS tagset consisting of 38 tag labels is used. The model is trained on 256,690 words and the remaining words are used in testing. The system obtained an accuracy of 89.21% and it is being compared with other existing stochastic models.


Sign in / Sign up

Export Citation Format

Share Document