Automated Text Analysis

Research Methods in the Social Sciences: An A-Z of key concepts ◽

10.1093/hepl/9780198850298.003.0002 ◽

2021 ◽

pp. 14-16

Author(s):

Panagis Yannis

Keyword(s):

Text Analysis ◽

Research Question ◽

Computer Software ◽

Computer Assisted ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Automated Text Analysis ◽

Textual Content ◽

Speech Tagging ◽

Processing Steps

This chapter examines automated text analysis (ATA), which describes the different methodologies that can be applied in order to perform text analysis with the use of computer software. ATA is a computer-assisted method for analysing text, whenever the analysis would be prohibitively labour-intensive due to the volume of texts to be analysed. ATA methods have become more popular due to current interest in big data, taking into account the volume of textual content that is made easily accessible by the digitization of human activity. Key to ATA is the notion of corpus, which is a collection of texts. A necessary step before starting any analysis is to collect together the necessary documents and construct the corpora that will be used. Which texts need to be included in this step is dictated by the research question. After text collection, some processing steps need to be taken before the analysis starts, for example tokenization and part-of-speech tagging. Tokenization is the process of splitting a text into its constituent words, also called tokens, whereas part-of-speech tagging assigns each word a label that indicates the respective part-of-speech.

Download Full-text

THE DICTIONARY STRUCTURE FOR EFFECTIVE WORD SEARCH

Information System in Management ◽

10.22630/isim.2017.6.4.3 ◽

2017 ◽

Vol 6 (4) ◽

pp. 284-293

Author(s):

Waldemar Karwowski

Keyword(s):

Open Source ◽

Text Analysis ◽

Final Part ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Polish Language ◽

Dictionary Structure ◽

Word Search ◽

Speech Tagging

In the paper some issues connected with indexing documents in the Polish language are discussed. Algorithms for stemming and part of speech tagging, important in the process of text analysis and indexing are shortly described. Next their suitability to the Polish language, which has a very extensive inflection, is discussed. The usefulness for stemming and part of speech tagging of large dictionaries with inflected forms, like WordNet and open-source dictionary of Polish language is also described. Two dictionary structures enabling effective word searching are presented. In the final part, some tests of implemented two dictionary structures are described. Tests were made on the six actual and three crafted artificial texts. At the end conclusions of performed tests are formulated.

Download Full-text

A Statistical Method for Evaluating Performance of Part of Speech Tagger for Gujarati

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1492.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3899-3903

Keyword(s):

Natural Language Processing ◽

Markov Model ◽

Language Processing ◽

Hidden Markov ◽

Model Error ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Textual Content ◽

Speech Tagging

Part of Speech Tagging has continually been a difficult mission in the era of Natural Language Processing. This article offers POS tagging for Gujarati textual content the use of Hidden Markov Model. Using Gujarati text annotated corpus for training checking out statistics set are randomly separated. 80% accuracy is given by model. Error analysis in which the mismatches happened is likewise mentioned in element.

Download Full-text

Analyzing Language in Suicide Notes and Legacy Tokens

Crisis ◽

10.1027/0227-5910/a000363 ◽

2016 ◽

Vol 37 (2) ◽

pp. 140-147 ◽

Cited By ~ 6

Author(s):

Michael J. Egnoto ◽

Darrin J. Griffin

Keyword(s):

Text Analysis ◽

Low Cost ◽

Computer Software ◽

Student Writing ◽

Computer Assisted ◽

Automated Identification ◽

Completed Suicide ◽

Suicide Notes ◽

Self Harm ◽

Harm To Others

Abstract. Background: Identifying precursors that will aid in the discovery of individuals who may harm themselves or others has long been a focus of scholarly research. Aim: This work set out to determine if it is possible to use the legacy tokens of active shooters and notes left from individuals who completed suicide to uncover signals that foreshadow their behavior. Method: A total of 25 suicide notes and 21 legacy tokens were compared with a sample of over 20,000 student writings for a preliminary computer-assisted text analysis to determine what differences can be coded with existing computer software to better identify students who may commit self-harm or harm to others. Results: The results support that text analysis techniques with the Linguistic Inquiry and Word Count (LIWC) tool are effective for identifying suicidal or homicidal writings as distinct from each other and from a variety of student writings in an automated fashion. Conclusion: Findings indicate support for automated identification of writings that were associated with harm to self, harm to others, and various other student writing products. This work begins to uncover the viability or larger scale, low cost methods of automatic detection for individuals suffering from harmful ideation.

Download Full-text

Lexical Rule and Lexicon Effect for Part of Speech Tagging Bahasa Madura

Matrik Jurnal Manajemen Teknik Informatika dan Rekayasa Komputer ◽

10.30812/matrik.v18i1.332 ◽

2018 ◽

Vol 18 (1) ◽

pp. 65-72

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Text Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging ◽

Bahasa Indonesia

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.

Download Full-text