Learning Preposition Priors to Generate Scene from Text Using Contact Constraints

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text

A COMPARATIVE STUDY OF STATISTICAL AND NATURAL LANGUAGE PROCESSING TECHNIQUES FOR SENTIMENT ANALYSIS

Jurnal Teknologi ◽

10.11113/jt.v77.6502 ◽

2015 ◽

Vol 77 (18) ◽

Cited By ~ 1

Author(s):

Wai-Howe Khong ◽

Lay-Ki Soon ◽

Hui-Ngo Goh

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Word Sense Disambiguation ◽

Statistical Technique ◽

Machine Learning Algorithms ◽

Support Vector ◽

Word Sense ◽

Pos Tagging

Sentiment analysis has emerged as one of the most powerful tools in business intelligence. With the aim of proposing an effective sentiment analysis technique, we have performed experiments on analyzing the sentiments of 3,424 tweets using both statistical and natural language processing (NLP) techniques as part of our background study. For statistical technique, machine learning algorithms such as Support Vector Machines (SVMs), decision trees and Naïve Bayes have been explored. The results show that SVM consistently outperformed the rest in both classifications. As for sentiment analysis using NLP techniques, we used two different tagging methods for part-of-speech (POS) tagging. Subsequently, the output is used for word sense disambiguation (WSD) using WordNet, followed by sentiment identification using SentiWordNet. Our experimental results indicate that adjectives and adverbs are sufficient to infer the sentiment of tweets compared to other combinations. Comparatively, the statistical approach records higher accuracy than the NLP approach by approximately 17%.

Download Full-text

POS Tagging Bahasa Madura dengan Menggunakan Algoritma Brill Tagger

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020722449 ◽

2020 ◽

Vol 7 (6) ◽

pp. 1121

Author(s):

Nindian Puspa Dewi ◽

Ubaidi Ubaidi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Digital Media ◽

Language Processing ◽

Text Processing ◽

Pos Tagging ◽

Part Of Speech ◽

Average Accuracy ◽

Conducting Research ◽

Regional Languages

Bahasa Madura adalah bahasa daerah yang selain digunakan di Pulau Madura juga digunakan di daerah lainnya seperti di kota Jember, Pasuruan, dan Probolinggo. Sebagai bahasa daerah, Bahasa Madura mulai banyak ditinggalkan khususnya di kalangan anak muda. Beberapa penyebabnya adalah adanya rasa gengsi dan tingkat kesulitan untuk mempelajari Bahasa Madura yang memiliki ragam dialek dan tingkat bahasa. Berkurangnya penggunaan Bahasa Madura dapat mengakibatkan punahnya Bahasa Madura sebagai salah satu bahasa daerah yang ada di Indonesia. Oleh karena itu, perlu adanya usaha untuk mempertahankan dan memelihara Bahasa Madura. Salah satunya adalah dengan melakukan penelitian tentang Bahasa Madura dalam bidang Natural Language Processing sehingga kedepannya pembelajaran tentang Bahasa Madura dapat dilakukan melalui media digital. Part Of Speech (POS) Tagging adalah dasar penelitian text processing, sehingga perlu untuk dibuat aplikasi POS Tagging Bahasa Madura untuk digunakan pada penelitian Natural Languange Processing lainnya. Dalam penelitian ini, POS Tagging dibuat dengan menggunakan Algoritma Brill Tagger dengan menggunakan corpus yang berisi 10.535 kata Bahasa Madura. POS Tagging dengan Brill Tagger dapat memberikan kelas kata yang sesuai pada kata dengan menggunakan aturan leksikal dan kontekstual. Brill Tagger merupakan algoritma dengan tingkat akurasi yang paling baik saat diterapkan dalam Bahasa Inggris, Bahasa Indonesia dan beberapa bahasa lainnya. Dari serangkaian percobaan dengan beberapa perubahan nilai threshold tanpa memperhatikan OOV (Out Of Vocabulary), menunjukkan rata-rata akurasi mencapai lebih dari 80% dengan akurasi tertinggi mencapai 86.67% dan untuk pengujian dengan memperhatikan OOV mencapai rata-rata akurasi 67.74%. Jadi dapat disimpulkan bahwa Brill Tagger dapat digunakan untuk Bahasa Madura dengan tingkat akurasi yang baik. Abstract Bahasa Madura is regional language which is not only used on Madura Island but is also used in other areas such as in several regions in Jember, Pasuruan, and Probolinggo. Today, Bahasa Madura began to be abandoned, especially among young people. One reason is sense of pride and also quite difficult to learn Bahasa Madura because it has a variety of dialects and language levels. The reduced use of Bahasa Madura can lead to the extinction of Bahasa Madura as one of the regional languages in Indonesia. Therefore, there needs to be an effort to maintain Madurese Language. One of them is by conducting research on Madurese Language in the field of Natural Language Processing so that in the future learning about Madurese can be done through digital media. Part of Speech (POS) Tagging is the basis of text processing research, so the Madura Language POS Tagging application needs to be made for use in other Natural Language Processing research. This study uses Brill Tagger by using a corpus containing 10,535 words. POS Tagging with Brill Tagger Algorithm can provide the appropriate word class to word using lexical and contextual rule. The reason for using Brill Tagger is because it is the algorithm that has the best accuracy when implemented in English, Indonesian and several other languages. The experimental results with Brill Tagger show that the average accuracy without OOV (Out Of Vocabulary) obtained is 86.6% with the highest accuracy of 86.94% and the average accuracy for OOV words reached 67.22%. So it can be concluded that the Brill Tagger Algorithm can also be used for Bahasa Madura with a good degree of accuracy.

Download Full-text

IMPLEMENTASI NATURAL LANGUAGE PROCESSING (NLP) UNTUK APLIKASI PENCARIAN LOKASI

Jurnal Nasional Teknologi Terapan (JNTT) ◽

10.22146/jntt.35036 ◽

2021 ◽

Vol 3 (2) ◽

pp. 15

Author(s):

Irkham Huda

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pos Tagging ◽

Bahasa Indonesia

Pencarian lokasi menjadi salah satu kebutuhan masyarakat dewasa ini terbukti dengan banyaknya penyedia layanan pemetaan. Untuk mencari lokasi dengan referensi relasi spasial tertentu, pengguna mendeskripsikannya dengan bahasa natural. Maka untuk membuat sistem pencarian lokasi yang mampu memahami masukan pengguna diperlukan implementasi Natural Language Processing (NLP). Penelitian terkait implementasi NLP untuk aplikasi pencarian lokasi masih dirasa perlu terutama karena belum adanya implementasi penelitian tersebut yang mendukung Bahasa Indonesia, sedangkan penelitian terkait yang sudah ada hanya mendukung Bahasa Inggris dengan cakupan terbatas.Dalam penelitian ini dikembangkan Sistem NLP untuk Aplikasi Pencarian Lokasi dikenal dengan NaLaMap. Basis data lokasi yang dimanfaatkan adalah Open Street Map (OSM) dan digunakan aplikasi web sebagai client untuk studi kasus. Dalam mentransformasikan kalimat masukan pencarian lokasi menjadi query spasial, Sistem NLP yang dibangun melalui lima tahapan utama yaitu Tokenisasi, POS Tagging, NER Tagging, Normalisasi Entitas, dan Penyusunan Query. Kemudian query yang berhasil disusun dijalankan pada basis data lokasi berbasis OSM sehingga diperoleh hasil pencarian yang akan ditampilkan melalui peta pada aplikasi client.Hasil uji coba sistem secara keseluruhan menggunakan 45 kalimat masukan dari responden, diperoleh hasil yang cukup bagus dengan nilai precision 0,97 dan recall 0,91.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text

PEMBANGKIT ENTITY RELATIONSHIP DIAGRAM DARI SPESIFIKASI KEBUTUHAN MENGGUNAKAN NATURAL LANGUAGE PROCESSING UNTUK BAHASA INDONESIA

Jurnal Komputer dan Informatika ◽

10.35508/jicon.v9i2.5051 ◽

2021 ◽

Vol 9 (2) ◽

pp. 196-206

Author(s):

Parmonangan R. Togatorop ◽

Rezky Prayitno Simanjuntak ◽

Siti Berliana Manurung ◽

Mega Christy Silalahi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rule Based ◽

Pos Tagging ◽

Sentence Segmentation ◽

Primary Key ◽

Entity Relationship ◽

Entity Relationship Diagram ◽

Bahasa Indonesia

Memodelkan Entity Relationship Diagram (ERD) dapat dilakukan secara manual, namun umumnya memperoleh pemodelan ERD secara manual membutuhkan waktu yang lama. Maka, dibutuhkan pembangkit ERD dari spesifikasi kebutuhan untuk mempermudah dalam melakukan pemodelan ERD. Penelitian ini bertujuan untuk mengembangkan sebuah sistem pembangkit ERD dari spesifikasi kebutuhan dalam Bahasa Indonesia dengan menerapkan beberapa tahapan-tahapan dari Natural Language Processing (NLP) sesuai kebutuhan penelitian. Spesifikasi kebutuhan yang digunakan tim peneliti menggunakan teknik document analysis. Untuk tahapan-tahapan dari NLP yang digunakan oleh peneliti yaitu: case folding, sentence segmentation, tokenization, POS tagging, chunking dan parsing. Kemudian peneliti melakukan identifikasi terhadap kata-kata dari teks yang sudah diproses pada tahapan-tahapan dari NLP dengan metode rule-based untuk menemukan daftar kata-kata yang memenuhi dalam komponen ERD seperti: entitas, atribut, primary key dan relasi. ERD kemudian digambarkan menggunakan Graphviz berdasarkan komponen ERD yang telah diperoleh Evaluasi hasil ERD yang berhasil dibangkitkan kemudian di evaluasi menggunakan metode evaluasi expert judgement. Dari hasil evaluasi berdasarkan beberapa studi kasus diperoleh hasil rata-rata precision, recall, F1 score berturut-turut dari tiap ahli yaitu: pada ahli 1 diperoleh 91%, 90%, 90%; pada ahli 2 diperoleh 90%, 90%, 90%; pada ahli 3 diperoleh 98%, 94%, 96%; pada ahli 4 diperoleh 93%, 93%, 93%; dan pada ahli 5 diperoleh 98%, 83%, 90%.

Download Full-text

Natural Language Processing for Requirements Engineering

ACM Computing Surveys ◽

10.1145/3444689 ◽

2021 ◽

Vol 54 (3) ◽

pp. 1-41

Author(s):

Liping Zhao ◽

Waad Alhoshan ◽

Alessio Ferrari ◽

Keletso J. Letsholo ◽

Muideen A. Ajagbe ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Requirements Engineering ◽

Language Processing ◽

New Technologies ◽

The State ◽

Mapping Study ◽

Pos Tagging ◽

Starting Point ◽

Holistic Understanding

Natural Language Processing for Requirements Engineering (NLP4RE) is an area of research and development that seeks to apply natural language processing (NLP) techniques, tools, and resources to the requirements engineering (RE) process, to support human analysts to carry out various linguistic analysis tasks on textual requirements documents, such as detecting language issues, identifying key domain concepts, and establishing requirements traceability links. This article reports on a mapping study that surveys the landscape of NLP4RE research to provide a holistic understanding of the field. Following the guidance of systematic review, the mapping study is directed by five research questions, cutting across five aspects of NLP4RE research, concerning the state of the literature, the state of empirical research, the research focus, the state of tool development, and the usage of NLP technologies. Our main results are as follows: (i) we identify a total of 404 primary studies relevant to NLP4RE, which were published over the past 36 years and from 170 different venues; (ii) most of these studies (67.08%) are solution proposals, assessed by a laboratory experiment or an example application, while only a small percentage (7%) are assessed in industrial settings; (iii) a large proportion of the studies (42.70%) focus on the requirements analysis phase, with quality defect detection as their central task and requirements specification as their commonly processed document type; (iv) 130 NLP4RE tools (i.e., RE specific NLP tools) are extracted from these studies, but only 17 of them (13.08%) are available for download; (v) 231 different NLP technologies are also identified, comprising 140 NLP techniques, 66 NLP tools, and 25 NLP resources, but most of them—particularly those novel NLP techniques and specialized tools—are used infrequently; by contrast, commonly used NLP technologies are traditional analysis techniques (e.g., POS tagging and tokenization), general-purpose tools (e.g., Stanford CoreNLP and GATE) and generic language lexicons (WordNet and British National Corpus). The mapping study not only provides a collection of the literature in NLP4RE but also, more importantly, establishes a structure to frame the existing literature through categorization, synthesis and conceptualization of the main theoretical concepts and relationships that encompass both RE and NLP aspects. Our work thus produces a conceptual framework of NLP4RE. The framework is used to identify research gaps and directions, highlight technology transfer needs, and encourage more synergies between the RE community, the NLP one, and the software and systems practitioners. Our results can be used as a starting point to frame future studies according to a well-defined terminology and can be expanded as new technologies and novel solutions emerge.

Download Full-text

A Deep Insight in Challenges of Natural Language Processing and Usage of Deep Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l1014.10812s319 ◽

2020 ◽

Vol 8 (12s3) ◽

pp. 50-54

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Automatic Speech Recognition ◽

Data Driven ◽

Deep Insight ◽

Pos Tagging ◽

Learning Techniques ◽

Using Data

Natural Language Processing (NLP) using the power of artificial intelligence has empowered the understanding of the language used by human. It has also enhanced the effectiveness of the communication between human and computers. The complexity and diversity of the huge datasets have raised the requirement for automatic analysis of the linguistic data by using data-driven approaches. The performance of the data-driven approaches is improved after the usage of different deep learning techniques in various application areas of NLP like Automatic Speech Recognition, POS tagging etc. The paper addresses the challenges faced in NLP and the use of deep learning techniques in different application areas of NLP.

Download Full-text

Statistical Models for Natural Language Processing

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.54 ◽

2016 ◽

Author(s):

Kenneth W. Church

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Optical Character Recognition ◽

Channel Model ◽

Word Sense ◽

Noisy Channel ◽

Spelling Correction ◽

Pos Tagging ◽

Author Identification

There is a considerable literature on applications of statistical methods in natural-language processing. This chapter focuses on two types of applications: (1) recognition/transduction applications based on Shannon’s Noisy Channel such as speech recognition, optical character recognition (OCR), spelling correction, part-of-speech (POS) tagging, and machine translation (MT); and (2) discrimination/ranking applications such as sentiment analysis, information retrieval, spam email filtering, author identification, and word sense disambiguation (WSD). Shannon’s Noisy-Channel model is often used for the first type, and linear separators such as Naive Bayes and logistic regression are often used for the second type. These techniques have produced successful products that are being used by large numbers of people every day: web search, spelling correction, translation, etc. Despite successes such as these, it should be mentioned that all approximations have their limitations. At some point, perhaps in the not-too-distant future, the next generation may discover that the low-hanging fruit has been pretty well picked over, and it may be necessary to revisit some of these classic limitations.

Download Full-text

Arabic Part-of-Speech Tagger, an Approach Based on Neural Network Modelling

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.29.14009 ◽

2018 ◽

Vol 7 (2.29) ◽

pp. 742

Author(s):

Rabab Ali Abumalloh ◽

Hasan Muaidi Al-Serhan ◽

Othman Bin Ibrahim ◽

Waheeb Abu-Ulbeh

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Artificial Neural ◽

Speech Tagging

POS-tagging gained the interest of researchers in computational linguistics sciences in the recent years. Part-of-speech tagging systems assign the proper grammatical tag or morpho-syntactical category labels automatically to every word in the corpus per its appearance on the text. POS-tagging serves as a fundamental and preliminary step in linguistic analysis which can help in developing many natural language processing applications such as: word processing systems, spell checking systems, building dictionaries and in parsing systems. Arabic language gained the interest of researchers which led to increasing demand for Arabic natural language processing systems. Artiﬁcial neural networks has been applied in many applications such as speech recognition and part of speech prediction, but it is considered as a new approach in Part-of-speech tagging. In this research, we developed an Arabic POS-tagger using artificial neural network. A corpus of 20,620 words, which were manually assigned to the appropriate tags was developed and used to train the artificial neural network and to test the part of speech tagger systems’ overall performance. The accuracy of the developed tagger reaches 89.04% using the testing dataset. While, it reaches 98.94% using the training dataset. By combining the two datasets, the accuracy rate for the whole system is 96.96%.

Download Full-text