Perbandingan Metode Term Weighting terhadap Hasil Klasifikasi Teks pada Dataset Terjemahan Kitab Hadis

Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.Comparison of a term weighting method for the text classification in Indonesian hadithHadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.

Download Full-text

KLASIFIKASILAGU DAERAH INDONESIA BERDASARKANLIRIKMENGGUNAKANMETODE TF-IDF DAN NAÏVE BAYES

Jurnal Teknologi Informasi dan Terapan ◽

10.25047/jtit.v4i1.20 ◽

2019 ◽

Vol 4 (1) ◽

pp. 47-52 ◽

Cited By ~ 1

Author(s):

Pujo Hari Saputro ◽

Michael Aristian ◽

Dyah ListianingTyas

Keyword(s):

Naive Bayes ◽

Naïve Bayes ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Penelitian ini merupakan salah satu upaya untuk mengklasifikasikan lagu daerah di Indonesia berdasarkan daerahnya. Pengklasifikasian ini diharapkan bisa menjadi salah satu cara untuk mengetahui dan memetakan lagu-lagu daerah di Indonesia sehingga bangsa Indonesiadapat mengenali budayanya sendiri. Jumlah data yang digunakan dalam penelitian ini adalah 90 lagu dari berbagai daerah. Lagu daerah akan diklasifikasikan berdasarkan daerahnya. penelitian ini akan mengujimetodeekstraksi fitur term frequency – inverse document frequency (TF-IDF) dan snaïvebayessebagai metode klasifikasinya. Penelitian ini membuktikan bahwa metode ekstraksi TF-IDF dan klasifikasi dengan naïve bayes dapat digunakan untuk melakukan klasifikasi lirik lagu daerah berdasarkan daerah lagunya dengan akurasi sebesar 73,4% pada set Indonesia Barat dan Indonesia Timur.

Download Full-text

Recommendation System Using Weighted TF-IDF and Naive Bayes Classifiers on RSS Contents

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2010.p0631 ◽

2010 ◽

Vol 14 (6) ◽

pp. 631-637

Author(s):

Incheon Paik ◽

◽

Hiroshi Mizugai ◽

Keyword(s):

Machine Learning ◽

Recommendation System ◽

Naive Bayes ◽

Naïve Bayes ◽

Bayes Classifier ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Rss Feeds ◽

Enormous Quantity

A recent increase in RDF Site Summary (RSS) feeds, used for news updates and blogs, has been caused by the widespread use of blogs. This means that much effort is now needed to search the contents of RSS feeds because of this enormous quantity of material. To solve this problem, recommendation systems enable users to obtain relevant RSS contents easily and quickly. In previous research, an RSS recommendation system was proposed that used the similarity between the Term Frequency (TF) of the RSS contents and the TF derived from the contents of the user’s browsing history for RSS feeds. In this paper, we use Term Frequency-Inverse Document Frequency (TF-IDF) calculations to propose a Weighted TF-IDF method, which focuses on the terms folded by the title tags in RSS contents as characteristic terms. In addition, we propose a new recommendation method, which uses a Naive Bayes classifier in a Machine Learning-based approach. Via experiments, we compare the proposed methods and the existing method in a prototype recommendation system, and we show that the proposed methods outperform the existing method with respect to several evaluation measurements.

Download Full-text

Sistem Analisis Sentimen pada Ulasan Produk Menggunakan Metode Naive Bayes

Jurnal Edukasi dan Penelitian Informatika (JEPIN) ◽

10.26418/jp.v4i2.27526 ◽

2018 ◽

Vol 4 (2) ◽

pp. 113

Author(s):

Billy Gunawan ◽

Helen Sasty Pratiwi ◽

Enda Esyudha Pratama

Keyword(s):

Naive Bayes ◽

Confusion Matrix ◽

Naïve Bayes ◽

Frequency Data ◽

Online Data ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Bahasa Indonesia

Sistem analisis sentimen merupakan sistem yang digunakan untuk melakukan proses analisis otomatis pada ulasan produk online bahasa Indonesia untuk memperoleh informasi meliputi informasi sentimen yang merupakan bagian dari ulasan online. Data tersebut diklasifikasikan menggunakan Naive Bayes. Sistem analisis sentimen dibagi menjadi 5 (lima) tahap, yaitu crawling, pre-processing, pembobotan kata, pembentukan model dan klasifikasi sentimen. Pada pembobotan kata digunakan metode TF-IDF (Term Frequency – Inverse Document Frequency). Data yang ada akan diklasifikasikan ke dalam 5 (lima) kelas, yaitu sangat negatif, negatif, netral, positif dan sangat positif. Data tersebut kemudian akan dievaluasi menggunakan pengujian confusion matrix dengan parameter akurasi, recall, dan precision. Hasil pengujian menunjukkan pada pengujian 3 kelas (negatif, netral dan positif) hasil terbaik didapatkan pada 90% data latih dan 10% data uji dengan nilai akurasi 77.78%, recall 93.33% dan precision 77.78% dan pada pengujian 5 kelas hasil terbaik didapatkan pada 90% data latih dan 10% data uji dengan nilai akurasi 59.33 %, recall 58.33 % dan precision 59.33 %. Hasil prediksi kelas data uji yang relevan dibandingkan antara kelas sentimen yang ditandai supervisor dan kelas sentimen yang dihasilkan oleh sistem analisis sentimen walaupun belum sepenuhnya akurat.

Download Full-text

Aspect Category Classification dengan Pendekatan Machine Learning Menggunakan Dataset Bahasa Indonesia

Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) ◽

10.22146/jnteti.v10i3.1819 ◽

2021 ◽

Vol 10 (3) ◽

pp. 229-235

Author(s):

Syaifulloh Amien Pandega Perdana ◽

Teguh Bharata Aji ◽

Ridi Ferdiana

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Sentiment Analysis ◽

Support Vector ◽

Term Weighting ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Bahasa Indonesia

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.

Download Full-text

Hierarchical text classification using Relative Inverse Document Frequency

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.2021152.240515 ◽

2021 ◽

Vol 15 (2) ◽

pp. 166-176

Author(s):

Boonthida Chiraratanasopha ◽

Thanaruk Theeramunkong ◽

Salin Boonbrahm

Keyword(s):

Text Classification ◽

Term Weighting ◽

Hierarchical Tree ◽

Inverse Document Frequency ◽

Document Frequency ◽

Relative Inverse ◽

The Hierarchical Structure ◽

Family Based ◽

Hierarchical Text Classification

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.

Download Full-text

Subjectivity Classification of Filipino Text with Features Based on Term Frequency -- Inverse Document Frequency

2013 International Conference on Asian Language Processing ◽

10.1109/ialp.2013.40 ◽

2013 ◽

Cited By ~ 2

Author(s):

Ralph Vincent J. Regalado ◽

Jenina L. Chua ◽

Justin L. Co ◽

Thomas James Z. Tiam-Lee

Keyword(s):

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Download Full-text

Komparasi Term Weighting dan Word Embedding pada Klasifikasi Tweet Pemerintah Daerah

Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) ◽

10.22146/jnteti.v9i2.90 ◽

2020 ◽

Vol 9 (2) ◽

pp. 155-161

Author(s):

Pande Made Risky Cahya Dinatha ◽

Nur Aini Rakhmawati

Keyword(s):

Logistic Regression ◽

Word Embedding ◽

Term Weighting ◽

Inverse Document Frequency ◽

Term Frequency ◽

Average Recall ◽

Document Frequency

Munculnya media sosial mendorong pemerintah untuk memanfaatkan media sosial sebagai sarana penyebaran informasi. Informasi yang diberikan haruslah bermanfaat bagi masyarakat dalam rangka meningkatkan hubungan government to citizen. Klasifikasi terhadap unggahan media sosial pemerintah daerah dapat dilakukan untuk mengetahui jenis informasi yang diunggah. Penelitian klasifikasi unggahan media sosial pada studi kasus pemerintah daerah di Indonesia telah berhasil dilakukan, tetapi pengolahan teks untuk membangun model klasifikasinya masih dapat dieksplorasi. Metode pengolahan teks yang dibahas di dalam makalah ini adalah term weighting dan word embedding. Tujuan makalah ini adalah membandingkan term weighting term frequency-inverse document frequency, Okapi BM25, dan word embedding doc2vec dalam menghasilkan fitur untuk mengatasi masalah klasifikasi teks pendek. Makalah ini merepresentasikan teks sebagai fitur untuk melakukan klasifikasi, mengetahui kinerja model klasifikasi yang telah menerapkan teknik tersebut, dan membandingkan kinerja setiap model klasifikasi untuk mengetahui metode terbaik di dalam studi kasus klasifikasi unggahan media sosial pemerintah daerah di Indonesia. Terdapat enam kelas untuk mengklasifikasi 1.000 teks pendek dari 91 akun pemda. Pengukuran precision, recall, f-1, macro-average, micro-average, dan AUC dilakukan pada masing-masing model. Hasil menunjukkan bahwa model TF-IDF bersama SVM linear memberikan hasil yang lebih baik dibandingkan logistic regression dengan skor 0,572 dan 0,766 pada pengukuran macro-average recall dan micro-average recall.

Download Full-text

Implementation of Rumor Detection on Twitter Using J48 Algorithm

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i5.2059 ◽

2020 ◽

Vol 4 (5) ◽

pp. 775-781

Author(s):

Yoan Maria Vianny ◽

Erwin Budi Setiawan

Keyword(s):

Information Sources ◽

Detection System ◽

Training Data ◽

Weighting Method ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Rumor Detection

The existence of rumors on Twitter has caused a lot of unrest among Indonesians. Unrecognized validity confuses users for that information. In this study, an Indonesian rumor detection system is built by using J48 Algorithm in collaboration with Term Frequency Inverse Document Frequency (TF-IDF) weighting method. Dataset contains 47.449 tweets that have been manually labeled. This study offers new features, namely the number of emoticons in display name, the number of digits in display name, and the number of digits in username. These three new features are used to maximize information about information sources. The highest accuracy is obtained by 75.76% using 90% training data and 1.000 TF-IDF features in 1-gram to 3-gram combinations.

Download Full-text

Analisis Sentimen Kebijakan Kampus Merdeka Menggunakan Naive Bayes dan Pembobotan TF-IDF Berdasarkan Komentar pada Youtube

Jurnal Sistem Informasi, Teknologi Informasi, dan Edukasi Sistem Informasi ◽

10.25126/justsi.v2i1.24 ◽

2021 ◽

Vol 2 (1) ◽

Author(s):

Dhaifa Farah Zhafira ◽

Bayu Rahayudi ◽

Indriati Indriati

Keyword(s):

Cross Validation ◽

Naive Bayes ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Inverse Document Frequency ◽

Document Frequency ◽

Text Preprocessing ◽

Fold Cross Validation

Kebijakan Kampus Merdeka merupakan salah satu kebijakan baru yang digagas oleh Menteri Pendidikan dan Kebudayaan Republik Indonesia (Mendikbud RI). Kebijakan tersebut tengah ramai disorot publik khususnya pada platform Youtube berkaitan dengan video unggahan Mendikbud di kanalnya. Pada Youtube, opini masyarakat dapat membanjiri kolom komentar dalam sekejap karena kemunculannya sebagai platform pertama yang menawarkan fasilitas konten audio visual. Penelitian ini mencoba menganalisis opini masyarakat yang tertampung dalam kolom komentar Youtube ke dalam klasifikasi sentimen positif dan negatif. Klasifikasi diimplementasikan pada Google Colaboratory yang berbasis bahasa Python dan Jupyter Notebook dengan algoritme Naive Bayes Classifier serta pembobotan kata Term Frequency Inverse Document Frequency (TF-IDF). 5 proses utama dalam penelitian ini yang meliputi pelabelan manual, text preprocessing, pembobotan TF-IDF, validasi data menggunakan k-fold cross validation, dan klasifikasi. Hasil akurasi terbaik sebesar 97% yang didapat dengan menggunakan 900 data latih, 100 data uji, menerapkan pembobotan TF-IDF, dan 10-fold cross validation. Rata-rata akurasi yang didapat dari 10 iterasi pada k-fold cross validation yaitu sebesar 91.8% dengan nilai precision, recall, f-measure sebesar 90.35%, 93.6%, 91.95%. Berdasarkan hasil tersebut, Naive Bayes Classifier cukup baik sebagai alternatif untuk analisis sentimen.

Download Full-text

Perbandingan Optimasi Feature Selection pada Naïve Bayes untuk Klasifikasi Kepuasan Airline Passenger

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v5i3.3086 ◽

2021 ◽

Vol 5 (3) ◽

pp. 527-533

Author(s):

Yoga Religia ◽

Amali Amali

Keyword(s):

Feature Selection ◽

Customer Satisfaction ◽

Naive Bayes ◽

Naïve Bayes ◽

Point Of View ◽

Classification Model ◽

Passenger Satisfaction ◽

Airline Passenger ◽

Bayes Algorithm

The quality of an airline's services cannot be measured from the company's point of view, but must be seen from the point of view of customer satisfaction. Data mining techniques make it possible to predict airline customer satisfaction with a classification model. The Naïve Bayes algorithm has demonstrated outstanding classification accuracy, but currently independent assumptions are rarely discussed. Some literature suggests the use of attribute weighting to reduce independent assumptions, which can be done using particle swarm optimization (PSO) and genetic algorithm (GA) through feature selection. This study conducted a comparison of PSO and GA optimization on Naïve Bayes for the classification of Airline Passenger Satisfaction data taken from www.kaggle.com. After testing, the best performance is obtained from the model formed, namely the classification of Airline Passenger Satisfaction data using the Naïve Bayes algorithm with PSO optimization, where the accuracy value is 86.13%, the precision value is 87.90%, the recall value is 87.29%, and the value is AUC of 0.923.

Download Full-text