Automated Text Analysis

Author(s):  
Panagis Yannis

This chapter examines automated text analysis (ATA), which describes the different methodologies that can be applied in order to perform text analysis with the use of computer software. ATA is a computer-assisted method for analysing text, whenever the analysis would be prohibitively labour-intensive due to the volume of texts to be analysed. ATA methods have become more popular due to current interest in big data, taking into account the volume of textual content that is made easily accessible by the digitization of human activity. Key to ATA is the notion of corpus, which is a collection of texts. A necessary step before starting any analysis is to collect together the necessary documents and construct the corpora that will be used. Which texts need to be included in this step is dictated by the research question. After text collection, some processing steps need to be taken before the analysis starts, for example tokenization and part-of-speech tagging. Tokenization is the process of splitting a text into its constituent words, also called tokens, whereas part-of-speech tagging assigns each word a label that indicates the respective part-of-speech.

2017 ◽  
Vol 6 (4) ◽  
pp. 284-293
Author(s):  
Waldemar Karwowski

In the paper some issues connected with indexing documents in the Polish language are discussed. Algorithms for stemming and part of speech tagging, important in the process of text analysis and indexing are shortly described. Next their suitability to the Polish language, which has a very extensive inflection, is discussed. The usefulness for stemming and part of speech tagging of large dictionaries with inflected forms, like WordNet and open-source dictionary of Polish language is also described. Two dictionary structures enabling effective word searching are presented. In the final part, some tests of implemented two dictionary structures are described. Tests were made on the six actual and three crafted artificial texts. At the end conclusions of performed tests are formulated.


2019 ◽  
Vol 8 (2) ◽  
pp. 3899-3903

Part of Speech Tagging has continually been a difficult mission in the era of Natural Language Processing. This article offers POS tagging for Gujarati textual content the use of Hidden Markov Model. Using Gujarati text annotated corpus for training checking out statistics set are randomly separated. 80% accuracy is given by model. Error analysis in which the mismatches happened is likewise mentioned in element.


Crisis ◽  
2016 ◽  
Vol 37 (2) ◽  
pp. 140-147 ◽  
Author(s):  
Michael J. Egnoto ◽  
Darrin J. Griffin

Abstract. Background: Identifying precursors that will aid in the discovery of individuals who may harm themselves or others has long been a focus of scholarly research. Aim: This work set out to determine if it is possible to use the legacy tokens of active shooters and notes left from individuals who completed suicide to uncover signals that foreshadow their behavior. Method: A total of 25 suicide notes and 21 legacy tokens were compared with a sample of over 20,000 student writings for a preliminary computer-assisted text analysis to determine what differences can be coded with existing computer software to better identify students who may commit self-harm or harm to others. Results: The results support that text analysis techniques with the Linguistic Inquiry and Word Count (LIWC) tool are effective for identifying suicidal or homicidal writings as distinct from each other and from a variety of student writings in an automated fashion. Conclusion: Findings indicate support for automated identification of writings that were associated with harm to self, harm to others, and various other student writing products. This work begins to uncover the viability or larger scale, low cost methods of automatic detection for individuals suffering from harmful ideation.


Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.


2021 ◽  
Vol 184 ◽  
pp. 148-155
Author(s):  
Abdul Munem Nerabie ◽  
Manar AlKhatib ◽  
Sujith Samuel Mathew ◽  
May El Barachi ◽  
Farhad Oroumchian

Sign in / Sign up

Export Citation Format

Share Document