scholarly journals THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY

Author(s):  
Đorđe Petrović ◽  
Milena Stanković

Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the influence of these methods and tools on further text mining. We first focused on the analysis of the influence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the influence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods specific to the Serbian language for the purpose of calculating text similarity can lead to great differences in the results.

2018 ◽  
Vol 28 (2) ◽  
pp. 143
Author(s):  
Raghad M. Hadi

A quick growth of internet technology makes it easy to assemble a huge volume of data as text document; e. g., journals, blogs, network pages, articles, email letters. In text mining application, increasing text space of datasets represent excessive task which makes it hard to pre-processing documents in efficient way to prepare it for text mining application like document clustering. The proposed system focuses on pre-processing document and reduction document space technique to prepare it for clustering technique. The mutual method for text mining problematic is vector space model (VSM), each term represent a features. Thus the proposed system create vector-space mod-el by using pre-processing method to reduce of trivial data from dataset. While the hug dimen-sionality of VSM is resolved by using low-rank SVD. Experiment results show that the proposed system give better document representation results about 10% from previous approach to prepare it for document clustering


2014 ◽  
Vol 14 (3) ◽  
pp. 25-36
Author(s):  
Bohdan Pavlyshenko

Abstract This paper describes the analysis of possible differentiation of the author’s idiolect in the space of semantic fields; it also analyzes the clustering of text documents in the vector space of semantic fields and in the semantic space with orthogonal basis. The analysis showed that using the vector space model on the basis of semantic fields is efficient in cluster analysis algorithms of author’s texts in English fiction. The study of the distribution of authors' texts in the cluster structure showed the presence of the areas of semantic space that represent the idiolects of individual authors. Such areas are described by the clusters where only one author dominates. The clusters, where the texts of several authors dominate, can be considered as areas of semantic similarity of author’s styles. SVD factorization of the semantic fields matrix makes it possible to reduce significantly the dimension of the semantic space in the cluster analysis of author’s texts. Using the clustering of the semantic field vector space can be efficient in a comparative analysis of author's styles and idiolects. The clusters of some authors' idiolects are semantically invariant and do not depend on any changes in the basis of the semantic space and clustering method.


2013 ◽  
Vol 660 ◽  
pp. 202-206
Author(s):  
Cai Rui ◽  
Li Fei ◽  
Chen Bin ◽  
Quan Cong

In view of the fact that traditional vector space model for text similarity calculation which does not take word order into consideration leads to bias, this paper puts forward a longest common subsequence and the traditional vector space model of combining text similarity calculation. This method takes the word order and word frequency information into account, using the texts of the longest common subsequence and substring of their information from all public records and the use of word order and word frequency in the text. The importance of similarity calculation is acknowledged, and the traditional vector space model in the calculation of the weight is used on the word frequency information. Some of the dataset collected through the web crawler are used in the proposed text similarity calculation method for testing, and the results proved the effectivity of the method.


2018 ◽  
Vol 7 (2.3) ◽  
pp. 73 ◽  
Author(s):  
Robbi Rahim ◽  
Nuning Kurniasih ◽  
Muhammad Dedi Irawan ◽  
Yustria Handika Siregar ◽  
Abdurrozzaq Hasibuan ◽  
...  

Document is a written letter that can be used as evidence of information. Plagiarism is a deliberate or unintentional act of obtaining or attempting to obtain credit or value for a scientific work, citing some or all of the scientific work of another party acknowledged as a scientific work without stating the source properly and adequately. Latent Semantic Indexing method serves to find text that has the same text against from a document. The algorithm used is TF/IDF Algorithm that is the result of multiplication of TF value with IDF for a term in document while Vector Space Model (VSM) is method to see the level of closeness or similarity of word by way of weighting term.  


METIK JURNAL ◽  
2021 ◽  
Vol 5 (1) ◽  
pp. 36-41
Author(s):  
Pramudya Insan ◽  
Kusrini

Penggunaan algoritma pada pembuktian proses klasifikasi berbasis teks atau text mining sangat jarang dilakukan perbandingan khususnya untuk sebuah klasifikasi emosi. Banyak yang melakukan penelitian dalam klasifikasi tanpa unsur perbandingan didalamnya serta tidak terdapat penggunaan sistem yang dibangun secara mandiri. Pada penelitian ini perbandingan dilakukan untuk mengukur kemampuan algoritma dalam perolehan tingkat akurasi pada proses klasifikasi menggunana ID3 dan KNN. Data yang digunakan sebanyak 220 data berbasis teks berita yang diambil pada sistus warta media online yaitu viva.co.id, proses pelatihan data dilakukan dengan perbedaan proses pembobotan pada masing-masing algoritma yaitu dengan term weighting tf-idf untuk ID3 sedangkan KNN dengan similarity dan vector space model. Klasifikasi yang dilakukan untuk memperoleh data berkategori emosi dengan hasil akurasi yang didapatkan dari klasifikasi testing dengan data perbandingan yang beragam didapatkan akurasi paling tinggi yaitu 71.25 yaitu dengan perbandingan data latih dengan data uji 75%- 25%. Demikian penggunaan algoritma ID3 lebih baik dalam pengklasifikasian emosi berbahasa Indonesia dimana sebuah metode yang sangat efisien dalam pengelompokkan data berdasarkan kategori baik secara manual ataupun sistem.


Respati ◽  
2019 ◽  
Vol 14 (1) ◽  
Author(s):  
Ferdy Febriyanto

INTISARIPerkembangan sistem e-learning setiap tahunnya terus meningkat, hal ini dikarenakan sistem e-learning memberikan banyak kemudahan dalam pembelajaran. Beberapa institusi pendidikan khususnya perguruan tinggi negeri maupun swasta mulai mengembangkan sistem e-learning pada proses pengajarannya. Dalam konsep e-learning, pelaksanaan ujian dapat dilakukan, mulai dari menjawab soal ujian hingga proses penilaian selama ini kebanyakan proses ujian esai dan penilaiannya dilaksanakan secara manual yaitu dengan membaca esai satu per satu. Para dosen perlu menghabiskan banyak waktu untuk menilai jawaban ujian mahasiswa. Semakin banyak jumlah ujian yang dikoreksi, kualitas penilaian yang diberikan semakin menurun.Untuk memecahkan masalah tersebut dapat dilakukan dengan membuat suatu aplikasi yang dapat memproses kemiripan teks. Oleh karena itu dalam penelitian tesis ini, penulis menggunakan algoritma TF/IDF (Term Frequency – Inversed Document Frequency) dan VSM (Vector Space Model) yang secara prosesnya dapat mencari nilai kemiripan dari suatu teks jawaban dengan teks kunci jawaban. Nilai kemiripan teks tersebut dapat dijadikan acuan sebagai nilai koreksi jawaban ujian mahasiswa.Hasil penelitian menggunakan data dari Ujian Akhir Semester di STMIK Indonesia Banjarmasin dengan 10 mata kuliah, yaitu : Desain Grafis, Jaringan Komputer, Pengantar Teknologi Informasi, Kecakapan Antar Personal, Sistem Operasi, Pengantar Manajemen, Etika Profesi, Sistem Basis Data, Microprosessor, dan Pemrograman Web. Masing -masing mata kuliah diinputkan 30 soal dengan setiap soalnya memiliki 3 jawaban benar yang berbeda sebagai pembanding tingkat kemiripannya. Dalam prosesnya, sistem akan menghapus kata - kata yang dianggap tidak penting atau kata - kata yang terlalu umum digunakan termasuk karakter atau bentuk simbol, karena sistem hanya akan memproses soal yang memerlukan jawaban teoritis dan argumentasi bukan matematis. Untuk kasus pada penelitian tesis ini kata - kata dalam bahasa lokal Banjar juga akan dihilangkan oleh sistem untuk penyetaraan penggunaan bahasa Indonesia. Dengan kumpulan kata yang tersisa setelah proses penghilangan kata, perhitungan nilai bobot kata akan dilakukan algoritma TF/IDF dan dengan VSM akan dihitung nilai cosinus, sehingga didapatlah nilai tingkat kemiripan antara jawaban oleh mahasiswa dan jawaban oleh dosen. Tingkat kolerasi yang dihasilkan cukup baik dengan tingkat akurasi rata – rata 80% - 90% bila dibandingkan dengan penilaian yang dilakukan manusia secara manual. Kata Kunci : Penilaian Ujian Otomatis, TF/IDF, VSM, Similiaritas. ABSTRACTThe development of e-learning system every year keep on increased, this is because the e-learning system provides much convenience in learning. Some educational institutions, especially universities started to develop a system of e-learning in the teaching process. In the concept of e-learning, test execution can be carried out, started from answering the exam until this assessment process during most of the process of essay exams and assessments carried out manually, by reading essays one by one. The lecturers need to spend a lot of time to assess the student exam answers. The more of the number exam that corrected, quality assessment given decreased.To solve these problems can be done by creating an application that can process text similarity. Therefore, in this thesis, the author uses an algorithm TF / IDF (Term Frequency - Inversed Document Frequency) and VSM (Vector Space Model) in the process can seek similarity value of a answer text with the text of the answer key. The value of text similarity can be used reference as a correction value of the answers student exam.The results using data from the Final Examination in STMIK Indonesia Banjarmasin with 10 subjects, that is: Graphic Design, Computer Networking, Introduction to Information Technology, Skills Inter-Personal, Operating Systems, Introduction to Management, Profession Ethics, Database Systems, Microprosessor and web Programming. Each subjects entered 30 questions with each question have 3 completely different answers as the comparison level of similarity. In the process, the system will remove the words are considered unimportant or words are commonly used include characters or symbols, because the system only process the questions that need theoretical and arguments answers, not mathematical. For the case in this thesis,  words in the local Banjar language also eliminated by the system to equalize use of Indonesian language. With a set remains of words after the removal of the word, the word weighted value calculation algorithms will do TF / IDF and VSM will be calculated the cosine valule, so obtained value of the degree of similarity between answers by students and answers by lecturers. The correlation level result is good enough with the average accuracy rates 80% - 90% if compared with human assessment manually. Keywords : Automatic Exam Assessment, TF / IDF, VSM, Similiarity.


2021 ◽  
Vol 5 (1) ◽  
pp. 63-68
Author(s):  
Amalia Beladinna Arifa ◽  
Gita Fadila Fitriana ◽  
Ananda Rifkiy Hasan

One way to find out the quality of exam questions is by looking at the rules for writing exam questions made based on the subject or discussion contained in the learning plan document. Therefore, the exam questions that are arranged must be adjusted to the main material in each subject learning achievement. This study discusses the implementation of the concept in information retrieval systems using the Vector Space Model method. The Vector Space Model method has an advantage in query matching because it is able to match only part of the query with existing documents. In addition, the Vector Space Model method is also easy to adapt by adjusting parameters, including weighting parameters. The weighting calculation for each term that appears in the document uses TF-IDF. The purpose of this study is to design an information retrieval system to find the suitability of the exam question query with the subject contained in the learning plan document. The suitability is sorted based on the similarity value of the calculation results, from the largest value to the smallest value in the form of a percentage.


Sign in / Sign up

Export Citation Format

Share Document