PENERAPAN METODE TERM FREQUENCY INVERSE DOCUMENT FREQUENCY (TF-IDF) DAN COSINE SIMILARITY PADA SISTEM TEMU KEMBALI INFORMASI UNTUK MENGETAHUI SYARAH HADITS BERBASIS WEB (STUDI KASUS: HADITS SHAHIH BUKHARI-MUSLIM)

ABSTRAK Hadits merupakan sumber ajaran Islam disamping Al-Qur’an. Tanpa hadits, syari’at Islam tidak dapat dimengerti secara utuh dan tidak dapat dilaksanakan. Namun dewasa ini, tidak sedikit orang yang keliru dalam memahaminya, hal tersebut disebabkan oleh banyaknya orang yang memahami hadits sebatas mengandalkan teks lahiriyah saja. Salah satu hal yang dapat kita tempuh untuk mengetahui makna yang terkandung dalam hadits adalah dengan mempelajari syarah hadits guna meminimalisir kesalahan penafsiran terhadap suatu hadits. Sejauh ini aplikasi syarah hadits yang ada masih terbatas, yaitu dalam bahasa full arab yang tidak semua orang dapat memahaminya. Sedangkan untuk bahasa indonesia hanya ada lidwa dan arbain, namun masih sangat luas jangkauannya. Oleh karena itu, diperlukan suatu sistem untuk solusi permasalahan tersebut, yaitu Sistem Temu Kembali Informasi yang dapat dimanfaatkan karena memberikan alternatif berupa metode similarity yang dapat digunakan untuk melakukan pencarian dokumen relevan dengan yang kita inginkan. Metode similiarity yang digunakan adalah cosine similarity dengan pembobotan kata menggunakan metode TFIDF dan menerapkan teks preprocessing terlebih dahulu untuk memperkecil term sehingga bisa mempercepat proses perhitungan term. Teks preprocessing tersebut meliputi tokenizing, stopword removal atau filtering, dan stemming. Hasil uji coba dengan pengujian confusion matrix didapatkan: recall 88.7%, precision 100%, accuracy 88,73 %, dan error rate 11,27 %. ABSTRACT Hadith is a source of Islamic teachings besides the Qur'an. Without using the hadith, the syari'at of Islam can not be fully understood and can not be implemented. But today, many people are mistaken in understanding it, it is caused by the many people who understand the hadith to rely on text lahiriyah only. One of the things that we can take to know the meaning contained in the hadith is to study syarah hadith in order to minimize misinterpretation of a hadith. So far the application of syarh hadith is still limited. Because so far the existing applications are still full Arab language that not everyone can understand it.. As for the Indonesian language there are only lidwa and arbain, but still very wide reach. Therefore, we need a system for the solution of the problem, that is Information Retrieval System which can be utilized because it provides an alternative in the form of similarity method that can be used to search documents relevant to what we want. The similiarity method used is cosine similarity with word weighting using TFIDF method and applying preprocessing text first to minimize term so that it can speed up the term calculation process. The preprocessing text includes tokenizing, stopword removal or filtering, and stemming. The results of testing with confusion matrix test obtained: 88.7% recall, precision 100%, accuracy 88.73%, and error rate 11.27%.

Download Full-text

Product Codefication Accuracy With Cosine Similarity And Weighted Term Frequency And Inverse Document Frequency (TF-IDF)

Journal of Applied Engineering and Technological Science (JAETS) ◽

10.37385/jaets.v2i2.210 ◽

2021 ◽

Vol 2 (2) ◽

pp. 62-69

Author(s):

Sintia Sintia ◽

Sarjon Defit ◽

Gunadi Widi Nurcahyo

Keyword(s):

Information System ◽

Cosine Similarity ◽

Search Process ◽

Inverse Document Frequency ◽

Term Frequency ◽

Product Code ◽

Document Frequency ◽

Similarity Method

In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. So we need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desired

Download Full-text

The Analysis of Proximity Between Subjects Based on Primary Contents Using Cosine Similarity on Lective

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v2i4.271 ◽

2017 ◽

pp. 299-308

Author(s):

Muhammad Andi Al-rizki ◽

Galih Wasis Wicaksono ◽

Yufis Azhar

Keyword(s):

Cosine Similarity ◽

High Similarity ◽

Inverse Document Frequency ◽

Term Frequency ◽

Research Results ◽

Document Frequency ◽

Precision And Accuracy ◽

The Relationship ◽

Similarity Method

In education world, recognizing the relationship between one subject and another is imperative. By recognizing the relationship between courses, performing sustainability mapping between subjects can be easily performed. Moreover, detecting and reducing any duplicated contents in several subjects will be also possible to execute. Of course, these conveniences will benefit lecturers, students and departments. It will ease the analysis and discussion processes between lecturers related to subjects in the same domain. In addition, students will conveniently choose a group of subjects they are interested in. Furthermore, departments can easily create a specialization group based on the similarity of the subjects and combine the courses possessing high similarity. In this research, given a good database, the relationship between subjects was calculated based on the proximity of the primary contents of the subjects. The feature used was term feature, in which value was determined by calculating TF-IDF (Term Frequency Inverse Document Frequency) from each term. In recognizing the value of proximity between subjects, cosine similarity method was implemented. Finally, testing was done utilizing precision, recall and accuracy method. The research results show that the precision and accuracy values are 90,91% and the recall value is 100%.

Download Full-text

Analysis of Matric Product Matching Between Cosine Similarity with Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec in PT. Pricebook Digital Indonesia

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195672 ◽

2020 ◽

pp. 105-112

Author(s):

Harni Kusniyati ◽

Arie Aditya Nugraha

Keyword(s):

Word Embedding ◽

Cosine Similarity ◽

Consumer Experience ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

The Relationship ◽

Product Specifications ◽

Similarity Method ◽

Product Matching

Consumers today have the option to purchase products from thousands of e-commerce. However, the completeness of the product specifications and taxonomies used to organize products differently in different electronic shop differently. To improve the consumer experience, Pricebook approach for integration of the product through the website to find the cheapest price from various platforms. In our writing, we do approach by using a model of neural language such as TF-IDF (term frequency-inverse document frequency) as well as Word2vec by using the method of cosine similarity. TF-IDF is a way to give the relationship a word weighting (term) against the document. Semantic vector or word embedding is one way to represent the structure of a sentence will be in align with manipulating sentences into vector shapes with Word2Vec. Cosine similarity method is a method to calculate the similarity between two objects that is expressed in two vectors by using keywords (keywords) of a document as the size so that it leads to more products matching good performance and categorization. In addition, we compare the results of the representation of the TF-IDF with Word2vec against a number of the data.

Download Full-text

Regresi Linear untuk Mengurangi Bias Sistem Penilaian Uraian Singkat

Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) ◽

10.22146/jnteti.v10i3.1983 ◽

2021 ◽

Vol 10 (3) ◽

pp. 221-228

Author(s):

Silmi Fauziati ◽

Adhistya Erna Permanasari ◽

Indriana Hidayah ◽

Eko Wahyu Nugroho ◽

Bobby Rian Dewangga

Keyword(s):

Cosine Similarity ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

E Learning

Makalah ini bertujuan untuk memperbaiki kinerja sistem penilaian tes uraian singkat. Perbaikan kinerja tersebut dilakukan dengan menambahkan regresi linear sederhana pada keluaran gabungan metode cosine similarity (dengan pembobotan frekuensi kata berbasis metode Term Frequency-Inverse Document Frequency (TF-IDF)) dan mekanisme pencocokan kata. Regresi linear dilakukan dengan menjadikan nilai uraian singkat (hasil cosine similarity dan pencocokan kata) sebagai variabel regressor. Untuk mengetahui efektivitas sistem penilaian yang diusulkan, diukur kinerja sistem penilaian relatif terhadap nilai manual yang dilakukan oleh dosen. Diperoleh bahwa sebelum dilakukan regresi linear, sistem penilaian cenderung mengeluarkan nilai lebih tinggi (nilai mengalami bias) dibandingkan nilai manual yang dilakukan dosen. Regresi linear memperbaiki kinerja sistem penilaian tersebut dengan mengurangi bias penilaian secara signifikan, yaitu nilai yang diberikan tidak cenderung lebih tinggi maupun lebih rendah daripada nilai manual oleh dosen. Bahwa bias penilaian dapat diturunkan secara signifikan dengan metode yang sederhana, yaitu regresi linear, diharapkan dapat memberikan kontribusi terhadap akselerasi proses penerapan sistem penilaian otomatis untuk tes uraian pada teknologi pembelajaran dalam jaringan seperti e-learning.

Download Full-text

Sentiment Anlysis On Customer Reviews Using Support Vector Machine and Usability Scoring Using System Usability Scale

Journal of Information Technology and Computer Science ◽

10.25126/jitecs.202163330 ◽

2021 ◽

Vol 6 (3) ◽

pp. 236-251

Author(s):

Novira Azpiranda ◽

Ahmad Afif Supianto ◽

Nanang Yudi Setiawan ◽

Endang Suryawati ◽

R. Sandra Yuwana ◽

...

Keyword(s):

Support Vector Machine ◽

Business Strategy ◽

Confusion Matrix ◽

Support Vector ◽

Inverse Document Frequency ◽

Customer Reviews ◽

A Value ◽

Document Frequency ◽

System Usability Scale ◽

System Usability

Al-Ghiff Steak is a restaurant located in Cirebon City that offers quality steaks at affordable prices. For maintaining a competitive Al-Ghiff Steak advantage and reputation, it is important to build a good relationship with customers and have a business strategy that considers customer opinions. However, in its implementation, Al-Ghiff Steak has difficulty when collecting and processing customer review data manually. Therefore, it is necessary to conduct sentiment analysis by utilizing Google Reviews to determine customer perspectives regarding Al-Ghiff Steak products and services. This analysis was conducted on 968 Google Review reviews from 2016 to 2020 using the Support Vector Machine (SVM) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Classification testing is done with a confusion matrix against four parameters: accuracy, precision, recall, and f1-score. SVM with TF-IDF gets accuracy value 83%, precision 64%, recall 60% and f1-score 59%. The sentiment classification result is then visualized in the form of a dashboard. We utilize the System Usability Scale (SUS) for usability testing, which produces a value of 77.5. This result achieve the Acceptable category and an Excellent rating.

Download Full-text

Classification of Sentiment Based on Movie Feedback Given By Audiences

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7237.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 2594-2602

Keyword(s):

Confusion Matrix ◽

Real Life ◽

Class Imbalance ◽

Extraction Methods ◽

Class Imbalance Problem ◽

Inverse Document Frequency ◽

Imbalance Problem ◽

Life Problems ◽

Document Frequency

The need for generating automated sentiment on audience feedbacks has been the need of the hour. Manually going through the entire movie feedback becomes tedious therefore an attempt to predict the polarity of a movie based on the reviews using machine learning models is done. Usage of the IMDB movie reviews dataset has been done for training and testing. In this study we also try to depict the real-life problems of class imbalance and train-test splits, hence obtaining solutions for the same. The problem of class imbalance in today’s world has affected a large amount of predictive applications such as cancer detection , fraudulent transactions in banks etc, hence this study is an attempt to perform a solution to solve the class imbalance problem. Use of the undersampling method has been done in this study to improve the accuracy of an imbalanced class. Feature extraction methods such as Bag of Words and Term Frequency Inverse document Frequency have been used to generate features from the reviews. The Logistic regression and SVM classifiers have been used in the study to measure the accuracy. Along with the accuracy the Confusion Matrix has also been calculated to showcase the class imbalance taking its effect on the accuracy.

Download Full-text

Analysis of Unsatisfying User Experiences and Unmet Psychological Needs for Virtual Reality Exergames Using Deep Learning Approach

Information ◽

10.3390/info12110486 ◽

2021 ◽

Vol 12 (11) ◽

pp. 486

Author(s):

Xiaoyan Zhang ◽

Qiang Yan ◽

Simin Zhou ◽

Linye Ma ◽

Siran Wang

Keyword(s):

Virtual Reality ◽

Deep Learning ◽

Experimental Studies ◽

Online Reviews ◽

Psychological Needs ◽

Gradient Boosting ◽

Inverse Document Frequency ◽

Document Frequency ◽

Extreme Gradient Boosting ◽

Speed Up

The number of consumers playing virtual reality games is booming. To speed up product iteration, the user experience team needs to collect and analyze unsatisfying experiences in time. In this paper, we aim to detect the unsatisfying experiences hidden in online reviews of virtual reality exergames using a deep learning method and find out the unmet psychological needs of users based on self-determination theory. Convolutional neural networks for sentence classification (textCNN) are used in this study to classify online reviews with unsatisfying experiences. For comparison, we set eXtreme gradient boosting (XGBoost) with lexical features as the baseline of machine learning. Term frequency-inverse document frequency (TF-IDF) is used to extract keywords from every set of classified reviews. The micro-F1 score of textCNN classifier is 90.00, which is better than 82.69 of XGBoost. The top 10 keywords of every set of reviews reflect relevant topics of unmet psychological needs. This paper explores the potential problems causing unsatisfying experiences and unmet psychological needs in virtual reality exergames through text mining and makes a supplement for experimental studies about virtual reality exergames.

Download Full-text

IMPLEMENTASI TOKENIZING PLUS PADA SISTEM PENDETEKSI KEMIRIPAN JURNAL SKRIPSI

Jurnal Informatika Polinema ◽

10.33795/jip.v2i1.50 ◽

2017 ◽

Vol 2 (1) ◽

pp. 24

Author(s):

Paratisa Kharismadita ◽

Faisal Rahutomo

Keyword(s):

Cosine Similarity ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Syarat lulus bagi mahasiswa program sarjana, magister dan doktor salah satunya adalah mempublikasikan karya ilmiah. Untuk lulus Sarjana harus menghasilkan jurnal yang terbit pada jurnal ilmiah. Namun banyak sekali kasus plagiarisme atau penjiplakan jurnal yang marak terjadi di Indonesia. Tidak hanya dikalangan mahasiswa program sarjana namun juga terjadi pada beberapa kasus di program magister dan doktoral di beberapa instansi pendidikan. Penerapan sistem pendeteksi kemiripan jurnal tentunya sangat diperlukan untuk mengurangi kasus plagiarisme di kalangan pendidikan. Tahapan yang harus dilalui pada sistem yaitu Tokenizing Plus (membuat library kata berdasarkan KBBI). Tokenizing Plus merupakan proses untuk mendapatkan kata dasar dan kata majemuk yang ada pada KBBI. Metode yang digunakan adalah Term Frequency dan Inverse Document Frequency (TF-IDF) dan Cosine Similarity untuk mendapatkan nilai kemiripan. Sistem ini membandingkan keseluruhan dari isi jurnal mulai dari abstrak, judul dan konten.

Download Full-text

Klasifikasi Artikel Ilmiah Dengan Berbagai Skenario Preprocessing

Sains, Aplikasi, Komputasi dan Teknologi Informasi ◽

10.30872/jsakti.v2i2.2681 ◽

2020 ◽

Vol 2 (2) ◽

pp. 70

Author(s):

Hidayatul Ma'rifah ◽

Aji Prasetya Wibawa ◽

Muhammad Iqbal Akbar

Keyword(s):

Text Mining ◽

Vector Space ◽

Cross Validation ◽

Confusion Matrix ◽

Vector Space Model ◽

Nearest Neighbour ◽

Inverse Document Frequency ◽

Space Model ◽

Document Frequency ◽

Fold Cross Validation

Penelitian ini bertujuan untuk menemukan kombinasi dan urutan preprocessing dalam text mining yang paling maksimal untuk klasifikasi bidang jurnal berbahasa Indonesia berdasarkan judul dan abstraknya. Tahap-tahap preprocessing yang akan diterapkan terdiri dari case folding, stemming, stopwords removal, transformasi VSM (Vector Space Model), dan SMOTE. Namun, pengamatan tiap skenario berfokus pada stemming dan dua teknik stopwords removal, yaitu stopwords removal berbasis kamus, dan berbasis document frequency setelah melewati proses transformasi ke dalam bentuk VSM dengan pembobotan TF-IDF (Term Trequency–Inverse Document Frequency). Proses klasifikasi mengadopsi algoritma k-NN (K-Nearest Neighbour), yang menentukan kelas suatu data tes dengan melihat tetangga terdekatnya. Dalam penelitian ini, metrik untuk menemukan jarak tetangga terdekat adalah Cosine Similarity. Pengujian klasifikasi menggunakan 10-Fold Cross Validation untuk menghasilkan confusion matrix sebagai hasil akhir. Kinerja klasifikasi terbaik dicapai dengan persentase accuracy sebesar 72.91% dan precision mencapai 73,36%.

Download Full-text

Pengelompokan Artikel Berita Berbahasa Indonesia dengan Agglomerative Clustering

10.31227/osf.io/e95qc ◽

2018 ◽

Author(s):

Yudi Wibisono ◽

Masayu Leylia Khodra

Keyword(s):

Cosine Similarity ◽

Agglomerative Clustering ◽

Single Linkage ◽

Inverse Document Frequency ◽

Complete Linkage ◽

Term Frequency ◽

Average Linkage ◽

Document Frequency ◽

Average Group

Makalah ini mengaplikasikan agglomerative clustering untuk pengelompokan artikel berita berbahasa Indonesia untuk sistem aggregator berita. Agglomerative clustering merupakan teknik clustering hirarki dengan keunggulan jumlah cluster tidak perlu ditentukan, dan kualitas cluster tidak bergantung pada inisialisasi awal anggota cluster. Empat linkage diimplementasikan yaitu single linkage, complete linkage, average linkage, dan average-group linkage. Clustering dilakukan dengan menggunakan fitur leksikal, pembobotan term-frequency inverse document-frequency (tf.idf), cosine similarity, dan minimum anggota cluster adalah tiga. Dengan menggunakan 104 artikel berbahasa Indonesia yang telah dilabeli, kualitas cluster terbaik dihasilkan agglomerative clustering dengan menggunakan complete linkage dan kemiripan minimum 0.3 (purity rata-rata 0.888 dan lima cluster) dan 0.4 (purity rata-rata 0.938 dan empat cluster). Hasil eksperimen juga menunjukkan bahwa complete linkage menghasilkan purity rata-rata terbaik dan konsisten dibandingkan jenis linkage lainnya, dan nilai purity akan semakin tinggi jika parameter min_sim diperbesar, tetapi hal tersebut menyebabkan jumlah cluster yang dihasilkan semakin kecil.

Download Full-text