Finite state segmentation of discourse into clauses

The paper presents background and motivation for a processing model that segments discourse into units that are simple, non-nested clauses, prior to the recognition of clause internal phrasal constituents, and experimental results in support of this model. One set of results is derived from a statistical reanalysis of the Swedish empirical data in Strangert, Ejerhed and Huber 1993 concerning the linguistic structure of major prosodic units. The other set of results is derived from experiments in segmenting part of speech annotated Swedish text corpora into clauses, using a new clause segmentation algorithm. The clause segmented corpus data is taken from the Stockholm Umeå Corpus (SUC), 1 M words of Swedish texts from different genres, part of speech annotated by hand, and from the Umeå corpus DAGENS INDUSTRI 1993 (DI93), 5 M words of Swedish financial newspaper text, processed by fully automatic means consisting of tokenizing, lexical analysis, and probabilistic POS tagging. The results of these two experiments show that the proposed clause segmentation algorithm is 96% correct when applied to manually tagged text, and 91% correct when applied to probabilistically tagged text.

Download Full-text

POS-tagging a bilingual parallel corpus: methods and challenges

Research in Corpus Linguistics ◽

10.32714/ricl.05.03 ◽

2017 ◽

pp. 35-46 ◽

Cited By ~ 2

Author(s):

Irene Doval

Keyword(s):

The Other ◽

Major Error ◽

Error Patterns ◽

Parallel Corpus ◽

Pos Tagging ◽

Ongoing Process ◽

Part Of Speech ◽

Improve Accuracy ◽

The One ◽

Speech Information

This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.

Download Full-text

Corpus of Usage Examples: What Is It Good For?

10.33011/computel.v1i.411 ◽

2019 ◽

Author(s):

Timofey Arkhangelskiy

Keyword(s):

The Other ◽

Data Driven ◽

Reference Grammar ◽

Text Corpora ◽

Corpus Studies ◽

Other Hand ◽

History Of ◽

Corpus Data ◽

Good For ◽

Word Senses

Lexicography and corpus studies of grammar have a long history of fruitful interaction. For the most part, however, this has been a one-way relationship. Lexicographers have extensively used corpora to identify previously undetected word senses or find natural usage examples; using lexicographic materials when conducting data-driven investigations of grammar, on the other hand, is hardly commonplace. In this paper, I present a Beserman Udmurt corpus made out of "artificial" dictionary examples. I argue that, although such a corpus can not be used for certain kinds of corpus-based research, it is nevertheless a very useful tool for writing a reference grammar of a language. This is particularly important in the case of underresourced endangered varieties, which Beserman is, because of the scarcity of available corpus data. The paper describes the process of developing the Beserman usage example corpus, explores its differences compared to traditional text corpora, and discusses how those can be beneficial for grammar research.

Download Full-text

Where corpus methods hit their limits: The case of separable adjectives in Bambara

Rhema ◽

10.31862/2500-2953-2018-4-34-49 ◽

2018 ◽

pp. 34-49

Author(s):

V. Vydrin

Keyword(s):

West Africa ◽

Body Part ◽

The Other ◽

Supplementary Data ◽

Corpus Study ◽

Part Of Speech ◽

Other Hand ◽

Reference Corpus ◽

Corpus Data

Separable adjectives represent a morphosyntactic subcategory of the part of speech of adjectives in Bambara (< Manding < Mande < Niger-Congo, Mali, West Africa). A separable adjective is a compound lexeme consisting of a noun root designating most often a body part, a qualitative verb root and a connector -la- ~ -lan- or -ma- ~ -man-. When used predicatively, the final component of a separable adjective (the qualitative verb root) is split from the rest of the form by the auxiliary word ka or man. Separable adjectives express mainly human qualities (moral or physical), and their semantics are very often idiomatic. The productivity of this subclass is limited. In order to establish an inventory of the separable adjectives, two approaches have been followed: elicitation and a search in the Bambara Reference Corpus (which included roughly 4,110,000 words at the time of this study). The potentially imaginable number of lexemes of this type equals 570 (15 noun roots × 19 qualitative verb roots × 2 connectors). Elicitation provided 75 separable adjectives, and the corpus study, 25, 3 of which are absent from the elicitated list. This experiment proves that in studies of derivative morphology, when a linguist needs to fill out a matrix, elicitation cannot simply be replaced by a corpus study. On the other hand, the corpus data provides invaluable supplementary data that cannot be obtained through elicitation

Download Full-text

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Journal of Linguistics/Jazykovedný casopis ◽

10.2478/jazcas-2021-0054 ◽

2021 ◽

Vol 72 (2) ◽

pp. 590-602

Author(s):

Kirill I. Semenov ◽

Armine K. Titizian ◽

Aleksandra O. Piskunova ◽

Yulia O. Korotkova ◽

Alena D. Tsvetkova ◽

...

Keyword(s):

Chinese Text ◽

Text Processing ◽

Word Segmentation ◽

The Other ◽

Pos Tagging ◽

Theoretical Comparison ◽

Linguistic Annotation ◽

Corpus Data ◽

Chinese Texts ◽

The One

Abstract The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

Research on comment target extracting in Chinese online shopping platform

International Journal of Crowd Science ◽

10.1108/ijcs-09-2018-0019 ◽

2018 ◽

Vol 2 (3) ◽

pp. 247-258

Author(s):

Zhishuo Liu ◽

Qianhui Shen ◽

Jingmiao Ma ◽

Ziqi Dong

Keyword(s):

Association Rule ◽

Online Shopping ◽

Low Frequency ◽

Online Reviews ◽

Content Type ◽

Pos Tagging ◽

Part Of Speech ◽

Frequency Feature ◽

Practical Implications ◽

Data Reading

Purpose This paper aims to extract the comment targets in Chinese online shopping platform. Design/methodology/approach The authors first collect the comment texts, word segmentation, part-of-speech (POS) tagging and extracted feature words twice. Then they cluster the evaluation sentence and find the association rules between the evaluation words and the evaluation object. At the same time, they establish the association rule table. Finally, the authors can mine the evaluation object of comment sentence according to the evaluation word and the association rule table. At last, they obtain comment data from Taobao and demonstrate that the method proposed in this paper is effective by experiment. Findings The extracting comment target method the authors proposed in this paper is effective. Research limitations/implications First, the study object of extracting implicit features is review clauses, and not considering the context information, which may affect the accuracy of the feature excavation to a certain degree. Second, when extracting feature words, the low-frequency feature words are not considered, but some low-frequency feature words also contain effective information. Practical implications Because of the mass online reviews data, reading every comment one by one is impossible. Therefore, it is important that research on handling product comments and present useful or interest comments for clients. Originality/value The extracting comment target method the authors proposed in this paper is effective.

Download Full-text

Relative clauses in Agul from a corpus-based perspective

Language Typology and Universals ◽

10.1515/stuf-2019-0029 ◽

2020 ◽

Vol 73 (1) ◽

pp. 113-158

Author(s):

Timur Maisak

Keyword(s):

High Frequency ◽

Relative Clauses ◽

High Ratio ◽

The Other ◽

The Core ◽

Text Corpora ◽

The Asymmetry ◽

Intransitive Verbs ◽

Syntactic Restrictions ◽

Locative Case

AbstractThis paper gives an account of participial clauses in Agul (Lezgic, Nakh-Daghestanian), based on a sample of 858 headed noun-modifying clauses taken from two text corpora, one spoken and one written. Noun-modifying clauses in Agul do not show syntactic restrictions on what can be relativized, and hence they instantiate the type known as GNMCCs, or general noun-modifying clause constructions. As the text counts show, intransitive verbs are more frequent than transitives and experiencer verbs in participial clauses, and among intransitive verbs, locative statives with the roots ‘be’ and ‘stay, remain’ account for half of all the uses. The asymmetry between the different relativization targets is also significant. Among the core arguments, the intransitive subject (S) is the most frequent target, patient (P) occupies second place, and agent (A) is comparatively rare. The preference of S and, in general, of S and P over A also holds true for most other Nakh-Daghestanian languages for which comparable counts are available. At the same time, Agul stands apart from the other languages by its high ratio of non-core relativization which accounts for 42% of all participial clauses. Addressee, arguments and adjuncts encoded with a locative case, as well as more general place and time relativizations show especially high frequency, outnumbering such arguments as experiencers, recipients, and predicative and adnominal possessors. Possible reasons for the high ratio of non-argument relativization are discussed in the paper.

Download Full-text

A Cascaded Unsupervised Model for PoS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3447759 ◽

2021 ◽

Vol 20 (1) ◽

pp. 1-23

Author(s):

Necva Bölücü ◽

Burcu Can

Keyword(s):

Linear Model ◽

Language Processing ◽

Bayesian Model ◽

Linear Models ◽

Syntactic Category ◽

Semantic Parsing ◽

Pos Tagging ◽

Part Of Speech ◽

Sentence Level ◽

Log Linear

Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.

Download Full-text

Copyful Streaming String Transducers

Fundamenta Informaticae ◽

10.3233/fi-2021-1998 ◽

2021 ◽

Vol 178 (1-2) ◽

pp. 59-76

Author(s):

Emmanuel Filiot ◽

Pierre-Alain Reynier

Keyword(s):

Polynomial Time ◽

Equivalence Problem ◽

Expressive Power ◽

The Other ◽

Finite State Automata ◽

Finite State ◽

Variable Content ◽

Deterministic Automata ◽

Intermediate Output ◽

Linear Manner

Copyless streaming string transducers (copyless SST) have been introduced by R. Alur and P. Černý in 2010 as a one-way deterministic automata model to define transductions of finite strings. Copyless SST extend deterministic finite state automata with a set of variables in which to store intermediate output strings, and those variables can be combined and updated all along the run, in a linear manner, i.e., no variable content can be copied on transitions. It is known that copyless SST capture exactly the class of MSO-definable string-to-string transductions, and are as expressive as deterministic two-way transducers. They enjoy good algorithmic properties. Most notably, they have decidable equivalence problem (in PSpace). On the other hand, HDT0L systems have been introduced for a while, the most prominent result being the decidability of the equivalence problem. In this paper, we propose a semantics of HDT0L systems in terms of transductions, and use it to study the class of deterministic copyful SST. Our contributions are as follows: (i)HDT0L systems and total deterministic copyful SST have the same expressive power, (ii)the equivalence problem for deterministic copyful SST and the equivalence problem for HDT0L systems are inter-reducible, in quadratic time. As a consequence, equivalence of deterministic SST is decidable, (iii)the functionality of non-deterministic copyful SST is decidable, (iv)determining whether a non-deterministic copyful SST can be transformed into an equivalent non-deterministic copyless SST is decidable in polynomial time.

Download Full-text