Drawbacks and Pitfalls of Machine-Readable Texts for Linguistic Research

1998 ◽  
Vol 3 (2) ◽  
pp. 211-228 ◽  
Author(s):  
Roberta Facchinetti

The paper highlights and discusses some practical issues related to the drawbacks and pitfalls of computerised texts in regard to both databases themselves and the software employed to codify and search them. In the first place, some corpora and databases are compiled in such a way as to be searched and analysed by means of tools which allow only specific kinds of search to be made. This often prevents scholars from carrying out their own free study of the data, thus hindering an effective, targeted analysis. Moreover, in some cases, the need for comprehensiveness leads to the codification and classification of subjective aspects like the text difficulty and the participants' social level This subjectivity of interpretation might mislead the researchers in a socially-orientated analysis. Finally, despite being highly sophisticated, the techniques employed for automated grammatical and part-of-speech tagging as well as for semantic and prosodic parsing appear not to be totally reliable, since mistakes in the codification of simple items are likely to occur. Each of the above thorny issues, together with some other minor matters, are testified to with instances drawn from the author's personal linguistic research on a variety of synchronic and diachronic corpora and databases.

Author(s):  
Divam Gupta ◽  
Tanmoy Chakraborty ◽  
Soumen Chakrabarti

In several natural language tasks, labeled sequences are available in separate domains (say, languages), but the goal is to label sequences with mixed domain (such as code-switched text). Or, we may have available models for labeling whole passages (say, with sentiments), which we would like to exploit toward better position-specific label inference (say, target-dependent sentiment annotation). A key characteristic shared across such tasks is that different positions in a primary instance can benefit from different ‘experts’ trained from auxiliary data, but labeled primary instances are scarce, and labeling the best expert for each position entails unacceptable cognitive burden. We propose GIRNet, a unified position-sensitive multi-task recurrent neural network (RNN) architecture for such applications. Auxiliary and primary tasks need not share training instances. Auxiliary RNNs are trained over auxiliary instances. A primary instance is also submitted to each auxiliary RNN, but their state sequences are gated and merged into a novel composite state sequence tailored to the primary inference task. Our approach is in sharp contrast to recent multi-task networks like the crossstitch and sluice networks, which do not control state transfer at such fine granularity. We demonstrate the superiority of GIRNet using three applications: sentiment classification of code-switched passages, part-of-speech tagging of codeswitched text, and target position-sensitive annotation of sentiment in monolingual passages. In all cases, we establish new state-of-the-art performance beyond recent competitive baselines.


2014 ◽  
Vol 519-520 ◽  
pp. 784-787
Author(s):  
Zhi Qiang Wu ◽  
Hong Zhi Yu ◽  
Shu Hui Wan

It’s a basic work for Tibetan information processing to tag the Tibetan parts of speech,the results can be used in machine translation, speech synthesis and so on. By studying the Tibetan language grammar and the classification of Tibetan parts of speech, established the Tibetan parts of speech tagging sets, and tagged the corpus, used the CRFs to solve the problem that automatic tagging of Tibetan parts of speech, the experimental results show that in the closed test set, part-of-speech tagging accuracy is 94.2%, and in the opening set, the accuracy is 91.5%.


Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.


2021 ◽  
Vol 184 ◽  
pp. 148-155
Author(s):  
Abdul Munem Nerabie ◽  
Manar AlKhatib ◽  
Sujith Samuel Mathew ◽  
May El Barachi ◽  
Farhad Oroumchian

Sign in / Sign up

Export Citation Format

Share Document