Chinese POS Disambiguation and Unknown Word Guessing with Lexicalized HMMs

Human Computer Interaction ◽

10.4018/978-1-87828-991-9.ch101 ◽

2009 ◽

pp. 1595-1607

Author(s):

Guohong Fu ◽

Kang-Kwong Luke

Keyword(s):

Contextual Information ◽

Word Formation ◽

Morphological Features ◽

Single Task ◽

Unknown Word ◽

Pos Tagging ◽

Part Of Speech ◽

Peking University ◽

The Way

This article presents a lexicalized HMM-based approach to Chinese part-of-speech (POS) disambiguation and unknown word guessing (UWG). In order to explore word-internal morphological features for Chinese POS tagging, four types of pattern tags are defined to indicate the way lexicon words are used in a segmented sentence. Such patterns are combined further with POS tags. Thus, Chinese POS disambiguation and UWG can be unified as a single task of assigning each known word to input a proper hybrid tag. Furthermore, a uniformly lexicalized HMM-based tagger also is developed to perform this task, which can incorporate both internal word-formation patterns and surrounding contextual information for Chinese POS tagging under the framework of HMMs. Experiments on the Peking University Corpus indicate that the tagging precision can be improved with efficiency by the proposed approach.

Download Full-text

The Ancient Greek adjective: semantic and grammatical features

The Journal of V. N. Karazin Kharkiv National University, Series "Philology" ◽

10.26565/2227-1864-2020-85-09 ◽

2020 ◽

Keyword(s):

Semantic Analysis ◽

Word Formation ◽

Morphological Features ◽

Ancient Greek ◽

Historical Process ◽

Lexical Meaning ◽

Part Of Speech ◽

Syntactic Features ◽

General Meaning ◽

Genus Number

The article reveals the essence of an Ancient Greek adjective as a separate part of speech. Thus, the substantive nature of an adjective was examined, including the historical process of its separation as an independent part of speech, with a consequent emphasis on the inseparability of adjectives and nouns by external signs in Ancient Greek. The analysis of the Greek adjectives was made on the grounds of their semantics, morphological features, syntactic functions. The semantic analysis was based on the studying of such concepts as the categorial, word-building and lexical meaning. The categorial meaning is the attribution of an adjective. The smaller semantic-grammatical groups (qualitative, relative and possessive adjectives) were learnt with regard to word formation and lexical motivation. Word-building and lexical meanings were studied basing on the division of adjectives into primary units and derivatives. The meaning of a derivative is interpreted both: due to the analysis of its structure (paying a special attention to the compound units, which are mainly formed on the basis of word combinations), and due as to the relation (strong, weak, metaphorical) of the general meaning of a derivative with the meaning of its components. The word-formation meaning of such units, therefore, is syntagmatic. Their lexical semantics depend also on the context. The basic morphological categories of genus, number and case of a Greek adjective simultaneously indicates its semantic dependence on a noun. The category of degrees of comparison was analyzed on terms of morphological means and such syntactic features as left/right-side valence. The main primary (an attribute) and the secondary (as a predicative) syntactic adjective functions are equally realized in preposition or postposition to the noun in Ancient Greek.

Download Full-text

A Hierarchical Sequence-to-Sequence Model for Korean POS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3421762 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-13

Author(s):

Guozhe Jin ◽

Zhezhou Yu

Keyword(s):

Language Processing ◽

Morphological Analysis ◽

State Of The Art ◽

Contextual Information ◽

Surface Form ◽

Pos Tagging ◽

Part Of Speech ◽

Base Form ◽

High Level ◽

Tagging Methods

Part-of-speech (POS) tagging is a fundamental task in natural language processing. Korean POS tagging consists of two subtasks: morphological analysis and POS tagging. In recent years, scholars have tended to use the seq2seq model to solve this problem. The full context of a sentence is considered in these seq2seq-based Korean POS tagging methods. However, Korean morphological analysis relies more on local contextual information, and in many cases, there exists one-to-one matching between morpheme surface form and base form. To make better use of these characteristics, we propose a hierarchical seq2seq model. In our model, the low-level Bi-LSTM encodes the syllable sequence, whereas the high-level Bi-LSTM models the context information of the whole sentence, and the decoder generates the morpheme base form syllables as well as the POS tags. To improve the accuracy of the morpheme base form recovery, we introduced the convolution layer and the attention mechanism to our model. The experimental results on the Sejong corpus show that our model outperforms strong baseline systems in both morpheme-level F1-score and eojeol-level accuracy, achieving state-of-the-art performance.

Download Full-text

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Electronics ◽

10.3390/electronics11010056 ◽

2021 ◽

Vol 11 (1) ◽

pp. 56

Author(s):

Hongwei Li ◽

Hongyan Mao ◽

Jingzi Wang

Keyword(s):

Language Processing ◽

Short Term Memory ◽

Contextual Information ◽

Data Preprocessing ◽

Experimental Result ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Novel Approach

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text

YUKAGIR GEOGRAPHICAL TERMS IN TOPONYMY OF THE KOLYMA RIVER BASIN

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2020.9.73.717 ◽

2020 ◽

Vol 9 (4(73)) ◽

pp. 29-33

Author(s):

N.S. Bagdaryyn

Keyword(s):

River Basin ◽

Semantic Analysis ◽

Comparative Method ◽

Linguistic Analysis ◽

Word Formation ◽

Morphological Features ◽

North East ◽

The North ◽

First Time ◽

Kolyma River

The article continues the author's research on the toponymy of the North-East of the Sakha Republic, in particular the Kolyma river basin, in the aspect of the interaction of related and unrelated languages. The relevance of this work is defined in the description of local geographical terminology of Yukagir origin, as a valuable and important material in the further study of toponymy of the region. For the first time, the toponymy of the Kolyma river basin becomes the object of sampling and linguistic analysis of toponyms with local geographical terms of Yukagir origin in order to identify and analyze them linguistically. The research was carried out by comparative method, word formation, structural, lexical and semantic analysis. As a result of the research, phonetic and morphological features are revealed, the formation of local geographical terms and geographical names of Yukagir origin is outlined, and previously unrecorded semantic shifts and dialectisms are revealed. The most active in the formation of terms and toponyms is the geographical term iилil / eҕal 'coast‘, which is justified by the representation of the Yukagirs’ coast' home, housing

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

Lex-Pos Feature-Based Grammar Error Detection System for the English Language

Electronics ◽

10.3390/electronics9101686 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1686 ◽

Cited By ~ 1

Author(s):

Nancy Agarwal ◽

Mudasir Ahmad Wani ◽

Patrick Bours

Keyword(s):

Error Detection ◽

Network Architecture ◽

English Language ◽

Short Term Memory ◽

Detection System ◽

Structural Characteristics ◽

Contextual Information ◽

Systems Model ◽

Part Of Speech ◽

Induction Technique

This work focuses on designing a grammar detection system that understands both structural and contextual information of sentences for validating whether the English sentences are grammatically correct. Most existing systems model a grammar detector by translating the sentences into sequences of either words appearing in the sentences or syntactic tags holding the grammar knowledge of the sentences. In this paper, we show that both these sequencing approaches have limitations. The former model is over specific, whereas the latter model is over generalized, which in turn affects the performance of the grammar classifier. Therefore, the paper proposes a new sequencing approach that contains both information, linguistic as well as syntactic, of a sentence. We call this sequence a Lex-Pos sequence. The main objective of the paper is to demonstrate that the proposed Lex-Pos sequence has the potential to imbibe the specific nature of the linguistic words (i.e., lexicals) and generic structural characteristics of a sentence via Part-Of-Speech (POS) tags, and so, can lead to a significant improvement in detecting grammar errors. Furthermore, the paper proposes a new vector representation technique, Word Embedding One-Hot Encoding (WEOE) to transform this Lex-Pos into mathematical values. The paper also introduces a new error induction technique to artificially generate the POS tag specific incorrect sentences for training. The classifier is trained using two corpora of incorrect sentences, one with general errors and another with POS tag specific errors. Long Short-Term Memory (LSTM) neural network architecture has been employed to build the grammar classifier. The study conducts nine experiments to validate the strength of the Lex-Pos sequences. The Lex-Pos -based models are observed as superior in two ways: (1) they give more accurate predictions; and (2) they are more stable as lesser accuracy drops have been recorded from training to testing. To further prove the potential of the proposed Lex-Pos -based model, we compare it with some well known existing studies.

Download Full-text

Research on comment target extracting in Chinese online shopping platform

International Journal of Crowd Science ◽

10.1108/ijcs-09-2018-0019 ◽

2018 ◽

Vol 2 (3) ◽

pp. 247-258

Author(s):

Zhishuo Liu ◽

Qianhui Shen ◽

Jingmiao Ma ◽

Ziqi Dong

Keyword(s):

Association Rule ◽

Online Shopping ◽

Low Frequency ◽

Online Reviews ◽

Content Type ◽

Pos Tagging ◽

Part Of Speech ◽

Frequency Feature ◽

Practical Implications ◽

Data Reading

Purpose This paper aims to extract the comment targets in Chinese online shopping platform. Design/methodology/approach The authors first collect the comment texts, word segmentation, part-of-speech (POS) tagging and extracted feature words twice. Then they cluster the evaluation sentence and find the association rules between the evaluation words and the evaluation object. At the same time, they establish the association rule table. Finally, the authors can mine the evaluation object of comment sentence according to the evaluation word and the association rule table. At last, they obtain comment data from Taobao and demonstrate that the method proposed in this paper is effective by experiment. Findings The extracting comment target method the authors proposed in this paper is effective. Research limitations/implications First, the study object of extracting implicit features is review clauses, and not considering the context information, which may affect the accuracy of the feature excavation to a certain degree. Second, when extracting feature words, the low-frequency feature words are not considered, but some low-frequency feature words also contain effective information. Practical implications Because of the mass online reviews data, reading every comment one by one is impossible. Therefore, it is important that research on handling product comments and present useful or interest comments for clients. Originality/value The extracting comment target method the authors proposed in this paper is effective.

Download Full-text

A Cascaded Unsupervised Model for PoS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3447759 ◽

2021 ◽

Vol 20 (1) ◽

pp. 1-23

Author(s):

Necva Bölücü ◽

Burcu Can

Keyword(s):

Linear Model ◽

Language Processing ◽

Bayesian Model ◽

Linear Models ◽

Syntactic Category ◽

Semantic Parsing ◽

Pos Tagging ◽

Part Of Speech ◽

Sentence Level ◽

Log Linear

Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.

Download Full-text

Towards a Historical Treebank of Middle and Early Modern Welsh, Part I: Workflow and POS Tagging

Journal of Celtic Linguistics ◽

10.16922/jcl.22.6 ◽

2021 ◽

Vol 22 (1) ◽

pp. 125-154

Author(s):

Marieke Meelen ◽

David Willis

Keyword(s):

Early Modern ◽

Pos Tagging ◽

Part Of Speech ◽

European Languages ◽

Welsh Language ◽

Word Boundaries ◽

Exhaustive Extraction ◽

Grammatical Structures

This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.

Download Full-text