scholarly journals PENERAPAN METODE TERM FREQUENCY INVERSE DOCUMENT FREQUENCY (TF-IDF) DAN COSINE SIMILARITY PADA SISTEM TEMU KEMBALI INFORMASI UNTUK MENGETAHUI SYARAH HADITS BERBASIS WEB (STUDI KASUS: HADITS SHAHIH BUKHARI-MUSLIM)

2018 ◽  
Vol 11 (2) ◽  
pp. 149-164 ◽  
Author(s):  
Victor Amrizal

ABSTRAK Hadits merupakan sumber ajaran Islam disamping Al-Qur’an. Tanpa hadits, syari’at Islam tidak dapat dimengerti secara utuh dan tidak dapat dilaksanakan. Namun dewasa ini, tidak sedikit orang yang keliru dalam memahaminya, hal tersebut disebabkan oleh banyaknya orang yang memahami hadits sebatas mengandalkan teks lahiriyah saja. Salah satu hal yang dapat kita tempuh untuk mengetahui makna yang terkandung dalam hadits adalah dengan mempelajari syarah hadits guna meminimalisir kesalahan penafsiran terhadap suatu hadits. Sejauh ini aplikasi syarah hadits yang ada masih terbatas, yaitu dalam bahasa full arab yang tidak semua orang dapat memahaminya. Sedangkan untuk bahasa indonesia hanya ada lidwa dan arbain, namun masih sangat luas jangkauannya. Oleh karena itu, diperlukan suatu sistem untuk solusi permasalahan tersebut, yaitu  Sistem Temu Kembali Informasi yang dapat dimanfaatkan karena memberikan alternatif berupa metode similarity yang dapat digunakan untuk melakukan pencarian dokumen relevan dengan yang kita inginkan. Metode similiarity yang digunakan adalah cosine similarity dengan pembobotan kata menggunakan metode TFIDF dan menerapkan teks preprocessing terlebih dahulu untuk memperkecil term sehingga bisa mempercepat proses perhitungan term. Teks preprocessing tersebut meliputi tokenizing, stopword removal atau filtering, dan stemming. Hasil uji coba dengan pengujian confusion matrix didapatkan: recall 88.7%, precision 100%, accuracy 88,73 %, dan error rate 11,27 %.   ABSTRACT Hadith is a source of Islamic teachings besides the Qur'an. Without using the hadith, the syari'at of Islam can not be fully understood and can not be implemented. But today, many people are mistaken in understanding it, it is caused by the many people who understand the hadith to rely on text lahiriyah only. One of the things that we can take to know the meaning contained in the hadith is to study syarah hadith in order to minimize misinterpretation of a hadith. So far the application of syarh hadith is still limited. Because so far the existing applications are still full Arab language that not everyone can understand it.. As for the Indonesian language there are only lidwa and arbain, but still very wide reach. Therefore, we need a system for the solution of the problem, that is Information Retrieval System which can be utilized because it provides an alternative in the form of similarity method that can be used to search documents relevant to what we want. The similiarity method used is cosine similarity with word weighting using TFIDF method and applying preprocessing text first to minimize term so that it can speed up the term calculation process. The preprocessing text includes tokenizing, stopword removal or filtering, and stemming. The results of testing with confusion matrix test obtained: 88.7% recall, precision 100%, accuracy 88.73%, and error rate 11.27%. 

Author(s):  
Sintia Sintia ◽  
Sarjon Defit ◽  
Gunadi Widi Nurcahyo

In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. So we need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desired


Author(s):  
Muhammad Andi Al-rizki ◽  
Galih Wasis Wicaksono ◽  
Yufis Azhar

In education world, recognizing the relationship between one subject and another is imperative. By recognizing the relationship between courses, performing sustainability mapping between subjects can be easily performed.  Moreover, detecting and reducing any duplicated contents in several subjects will be also possible to execute. Of course, these conveniences will benefit lecturers, students and departments. It will ease the analysis and discussion processes between lecturers related to subjects in the same domain. In addition, students will conveniently choose a group of subjects they are interested in. Furthermore, departments can easily create a specialization group based on the similarity of the subjects and combine the courses possessing high similarity. In this research, given a good database, the relationship between subjects was calculated based on the proximity of the primary contents of the subjects. The feature used was term feature, in which value was determined by calculating TF-IDF (Term Frequency Inverse Document Frequency) from each term. In recognizing the value of proximity between subjects, cosine similarity method was implemented. Finally, testing was done utilizing precision, recall and accuracy method. The research results show that the precision and accuracy values are 90,91% and the recall value is 100%.


Author(s):  
Harni Kusniyati ◽  
Arie Aditya Nugraha

Consumers today have the option to purchase products from thousands of e-commerce. However, the completeness of the product specifications and taxonomies used to organize products differently in different electronic shop differently. To improve the consumer experience, Pricebook approach for integration of the product through the website to find the cheapest price from various platforms. In our writing, we do approach by using a model of neural language such as TF-IDF (term frequency-inverse document frequency) as well as Word2vec by using the method of cosine similarity. TF-IDF is a way to give the relationship a word weighting (term) against the document. Semantic vector or word embedding is one way to represent the structure of a sentence will be in align with manipulating sentences into vector shapes with Word2Vec. Cosine similarity method is a method to calculate the similarity between two objects that is expressed in two vectors by using keywords (keywords) of a document as the size so that it leads to more products matching good performance and categorization. In addition, we compare the results of the representation of the TF-IDF with Word2vec against a number of the data.


Author(s):  
Silmi Fauziati ◽  
Adhistya Erna Permanasari ◽  
Indriana Hidayah ◽  
Eko Wahyu Nugroho ◽  
Bobby Rian Dewangga

Makalah ini bertujuan untuk memperbaiki kinerja sistem penilaian tes uraian singkat. Perbaikan kinerja tersebut dilakukan dengan menambahkan regresi linear sederhana pada keluaran gabungan metode cosine similarity (dengan pembobotan frekuensi kata berbasis metode Term Frequency-Inverse Document Frequency (TF-IDF)) dan mekanisme pencocokan kata. Regresi linear dilakukan dengan menjadikan nilai uraian singkat (hasil cosine similarity dan pencocokan kata) sebagai variabel regressor. Untuk mengetahui efektivitas sistem penilaian yang diusulkan, diukur kinerja sistem penilaian relatif terhadap nilai manual yang dilakukan oleh dosen. Diperoleh bahwa sebelum dilakukan regresi linear, sistem penilaian cenderung mengeluarkan nilai lebih tinggi (nilai mengalami bias) dibandingkan nilai manual yang dilakukan dosen. Regresi linear memperbaiki kinerja sistem penilaian tersebut dengan mengurangi bias penilaian secara signifikan, yaitu nilai yang diberikan tidak cenderung lebih tinggi maupun lebih rendah daripada nilai manual oleh dosen. Bahwa bias penilaian dapat diturunkan secara signifikan dengan metode yang sederhana, yaitu regresi linear, diharapkan dapat memberikan kontribusi terhadap akselerasi proses penerapan sistem penilaian otomatis untuk tes uraian pada teknologi pembelajaran dalam jaringan seperti e-learning.


2021 ◽  
Vol 6 (3) ◽  
pp. 236-251
Author(s):  
Novira Azpiranda ◽  
Ahmad Afif Supianto ◽  
Nanang Yudi Setiawan ◽  
Endang Suryawati ◽  
R. Sandra Yuwana ◽  
...  

Al-Ghiff Steak is a restaurant located in Cirebon City that offers quality steaks at affordable prices. For maintaining a competitive Al-Ghiff Steak advantage and reputation, it is important to build a good relationship with customers and have a business strategy that considers customer opinions. However, in its implementation, Al-Ghiff Steak has difficulty when collecting and processing customer review data manually. Therefore, it is necessary to conduct sentiment analysis by utilizing Google Reviews to determine customer perspectives regarding Al-Ghiff Steak products and services. This analysis was conducted on 968 Google Review reviews from 2016 to 2020 using the Support Vector Machine (SVM) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Classification testing is done with a confusion matrix against four parameters: accuracy, precision, recall, and f1-score. SVM with TF-IDF gets accuracy value 83%, precision 64%, recall 60% and f1-score 59%. The sentiment classification result is then visualized in the form of a dashboard. We utilize the System Usability Scale (SUS) for usability testing, which produces a value of 77.5. This result achieve the Acceptable category and an Excellent rating.


2019 ◽  
Vol 8 (4) ◽  
pp. 2594-2602

The need for generating automated sentiment on audience feedbacks has been the need of the hour. Manually going through the entire movie feedback becomes tedious therefore an attempt to predict the polarity of a movie based on the reviews using machine learning models is done. Usage of the IMDB movie reviews dataset has been done for training and testing. In this study we also try to depict the real-life problems of class imbalance and train-test splits, hence obtaining solutions for the same. The problem of class imbalance in today’s world has affected a large amount of predictive applications such as cancer detection , fraudulent transactions in banks etc, hence this study is an attempt to perform a solution to solve the class imbalance problem. Use of the undersampling method has been done in this study to improve the accuracy of an imbalanced class. Feature extraction methods such as Bag of Words and Term Frequency Inverse document Frequency have been used to generate features from the reviews. The Logistic regression and SVM classifiers have been used in the study to measure the accuracy. Along with the accuracy the Confusion Matrix has also been calculated to showcase the class imbalance taking its effect on the accuracy.


Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 486
Author(s):  
Xiaoyan Zhang ◽  
Qiang Yan ◽  
Simin Zhou ◽  
Linye Ma ◽  
Siran Wang

The number of consumers playing virtual reality games is booming. To speed up product iteration, the user experience team needs to collect and analyze unsatisfying experiences in time. In this paper, we aim to detect the unsatisfying experiences hidden in online reviews of virtual reality exergames using a deep learning method and find out the unmet psychological needs of users based on self-determination theory. Convolutional neural networks for sentence classification (textCNN) are used in this study to classify online reviews with unsatisfying experiences. For comparison, we set eXtreme gradient boosting (XGBoost) with lexical features as the baseline of machine learning. Term frequency-inverse document frequency (TF-IDF) is used to extract keywords from every set of classified reviews. The micro-F1 score of textCNN classifier is 90.00, which is better than 82.69 of XGBoost. The top 10 keywords of every set of reviews reflect relevant topics of unmet psychological needs. This paper explores the potential problems causing unsatisfying experiences and unmet psychological needs in virtual reality exergames through text mining and makes a supplement for experimental studies about virtual reality exergames.


2017 ◽  
Vol 2 (1) ◽  
pp. 24
Author(s):  
Paratisa Kharismadita ◽  
Faisal Rahutomo

Syarat lulus bagi mahasiswa program sarjana, magister dan doktor salah satunya adalah mempublikasikan karya ilmiah. Untuk lulus Sarjana harus menghasilkan jurnal yang terbit pada jurnal ilmiah. Namun banyak sekali kasus plagiarisme atau penjiplakan jurnal yang marak terjadi di Indonesia. Tidak hanya dikalangan mahasiswa program sarjana namun juga terjadi pada beberapa kasus di program magister dan doktoral di beberapa instansi pendidikan. Penerapan sistem pendeteksi kemiripan jurnal tentunya sangat diperlukan untuk mengurangi kasus plagiarisme di kalangan  pendidikan.  Tahapan  yang  harus  dilalui  pada  sistem  yaitu  Tokenizing  Plus  (membuat  library  kata berdasarkan KBBI). Tokenizing Plus merupakan proses untuk mendapatkan kata dasar dan kata majemuk yang ada pada KBBI. Metode yang digunakan adalah Term Frequency dan Inverse Document Frequency (TF-IDF) dan Cosine Similarity untuk mendapatkan nilai kemiripan. Sistem ini membandingkan keseluruhan dari isi jurnal mulai dari abstrak, judul dan konten.


2020 ◽  
Vol 2 (2) ◽  
pp. 70
Author(s):  
Hidayatul Ma'rifah ◽  
Aji Prasetya Wibawa ◽  
Muhammad Iqbal Akbar

Penelitian ini bertujuan untuk menemukan kombinasi dan urutan preprocessing dalam text mining yang paling maksimal untuk klasifikasi bidang jurnal berbahasa Indonesia berdasarkan judul dan abstraknya. Tahap-tahap preprocessing yang akan diterapkan terdiri dari case folding, stemming, stopwords removal, transformasi VSM (Vector Space Model), dan SMOTE. Namun, pengamatan tiap skenario berfokus pada stemming dan dua teknik stopwords removal, yaitu stopwords removal berbasis kamus, dan berbasis document frequency setelah melewati proses transformasi ke dalam bentuk VSM dengan pembobotan TF-IDF (Term Trequency–Inverse Document Frequency). Proses klasifikasi mengadopsi algoritma k-NN (K-Nearest Neighbour), yang menentukan kelas suatu data tes dengan melihat tetangga terdekatnya. Dalam penelitian ini, metrik untuk menemukan jarak tetangga terdekat adalah Cosine Similarity. Pengujian klasifikasi menggunakan 10-Fold Cross Validation untuk menghasilkan confusion matrix sebagai hasil akhir. Kinerja klasifikasi terbaik dicapai dengan persentase accuracy sebesar 72.91% dan precision mencapai 73,36%.


2018 ◽  
Author(s):  
Yudi Wibisono ◽  
Masayu Leylia Khodra

Makalah ini mengaplikasikan agglomerative clustering untuk pengelompokan artikel berita berbahasa Indonesia untuk sistem aggregator berita. Agglomerative clustering merupakan teknik clustering hirarki dengan keunggulan jumlah cluster tidak perlu ditentukan, dan kualitas cluster tidak bergantung pada inisialisasi awal anggota cluster. Empat linkage diimplementasikan yaitu single linkage, complete linkage, average linkage, dan average-group linkage. Clustering dilakukan dengan menggunakan fitur leksikal, pembobotan term-frequency inverse document-frequency (tf.idf), cosine similarity, dan minimum anggota cluster adalah tiga. Dengan menggunakan 104 artikel berbahasa Indonesia yang telah dilabeli, kualitas cluster terbaik dihasilkan agglomerative clustering dengan menggunakan complete linkage dan kemiripan minimum 0.3 (purity rata-rata 0.888 dan lima cluster) dan 0.4 (purity rata-rata 0.938 dan empat cluster). Hasil eksperimen juga menunjukkan bahwa complete linkage menghasilkan purity rata-rata terbaik dan konsisten dibandingkan jenis linkage lainnya, dan nilai purity akan semakin tinggi jika parameter min_sim diperbesar, tetapi hal tersebut menyebabkan jumlah cluster yang dihasilkan semakin kecil.


Sign in / Sign up

Export Citation Format

Share Document