scholarly journals Hierarchical text classification using Relative Inverse Document Frequency

Author(s):  
Boonthida Chiraratanasopha ◽  
Thanaruk Theeramunkong ◽  
Salin Boonbrahm

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.

2021 ◽  
Vol 2021 ◽  
pp. 1-30
Author(s):  
Zhiying Jiang ◽  
Bo Gao ◽  
Yanlin He ◽  
Yongming Han ◽  
Paul Doyle ◽  
...  

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF ¯ , namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.


Rekayasa ◽  
2020 ◽  
Vol 13 (2) ◽  
pp. 172-180
Author(s):  
Ana Tsalitsatun Ni'mah ◽  
Agus Zainal Arifin

Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.Comparison of a term weighting method for the text classification in Indonesian hadithHadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.


2019 ◽  
Vol 9 (2) ◽  
Author(s):  
Nur Rafeeqkha Sulaiman ◽  
Maheyzah Md. Siraj

Due to the growth of Internet, it has not only become the medium for getting information, it has also become a platform for communicating. Social Network Service (SNS) is one of the main platform where Internet users can communicate by distributing, sharing of information and knowledge. Chatting has become a popular communication medium for Internet users whereby users can communicate directly and privately with each other. However, due to the privacy of chat rooms or chatting mediums, the content of chat logs is not monitored and not filtered. Thus, easing cyber predators preying on their preys. Cyber groomers are one of cyber predators who prey on children or minors to satisfy their sexual desire. Workforce expertise that involve in intelligence gathering always deals with difficulty as the complexity of crime increases, human errors and time constraints. Hence, it is difficult to prevent undesired content, such as grooming conversation, in chat logs. An investigation on two term weighting schemes on two datasets are used to improve the content-based classification techniques. This study aims to improve the content-based classification accuracy on chat logs by comparing two term weighting schemes in classifying grooming contents. Two term weighting schemes namely Term Frequency – Inverse Document Frequency – Inverse Class Space Density Frequency (TF.IDF.ICSdF) and Fuzzy Rough Feature Selection (FRFS) are used as feature selection process in filtering chat logs. The performance of these techniques were examined via datasets, and the accuracy of their result was measured by Support Vector Machine (SVM). TF.IDF.ICSdF and FRFS are judged based on accuracy, precision, recall and F score measurement.


Author(s):  
Syaifulloh Amien Pandega Perdana ◽  
Teguh Bharata Aji ◽  
Ridi Ferdiana

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.


Author(s):  
Charan Lokku

Abstract: To avoid fraudulent Job postings on the internet, we target to minimize the number of such frauds through the Machine Learning approach to predict the chances of a job being fake so that the candidate can stay alert and make informed decisions if required. The model will use NLP to analyze the sentiments and pattern in the job posting and TF-IDF vectorizer for feature extraction. In this model, we are going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the data and for classification, we used Random Forest to predict output with high accuracy, even for the large dataset it runs efficiently, and it enhances the accuracy of the model and prevents the overfitting issue. The final model will take in any relevant job posting data and produce a result determining whether the job is real or fake. Keywords: Natural Language Processing (NLP), Term Frequency-Inverse Document Frequency (TF-IDF), Synthetic Minority Oversampling Technique (SMOTE), Random Forest.


Author(s):  
Wahyu Adi Prabowo ◽  
Fitriani Azizah

Social media has become a new method of today’s communication in a new digitalize era. Children and adults have used social media a lot in interacting with others. Therefore social media has shifted conventional communication into digital one. This digital development on social media is a serious problem that must be faced because it has been found that there are more and more acts of cyberbullying. This act of cyberbullying can attack the psychic, causing depression up to suicide. The dangers of cyberbullying are troubling and cause concern to the community. Therefore, this study will analyze the sentiment on the comments contained on social media to find out the value of sentiment from comments on social media platforms. The comment data will be processed at the preprocessing stage, Term Frequency-Inverse Document Frequency (TF-IDF), and the Support Vector Machine (SVM) classification method. Comment data to be classified as 1500 data taken using crawling data through libraries in python programming and divided into 80% data training and 20% data testing. Based on the results of the test, the accuracy value is 93%, the precision value is 95%, and the recall value is 97%. In this research, a system model design is also carried out where the system can be integrated with the browser to open a user page on the classification of comments that have been input into the system.


2019 ◽  
Vol 3 (4) ◽  
pp. 53
Author(s):  
Ahmad Hawalah

Text classification is a process of classifying textual contents to a set of predefined classes and categories. As enormous numbers of documents and contextual contents are introduced every day on the Internet, it becomes essential to use text classification techniques for different purposes such as enhancing search retrieval and recommendation systems. A lot of work has been done to study different aspects of English text classification techniques. However, little attention has been devoted to study Arabic text classification due to the difficulty of processing Arabic language. Consequently, in this paper, we propose an enhanced Arabic topic-discovery architecture (EATA) that can use ontology to provide an effective Arabic topic classification mechanism. We have introduced a semantic enhancement model to improve Arabic text classification and the topic discovery technique by utilizing the rich semantic information in Arabic ontology. We rely in this study on the vector space model (term frequency-inverse document frequency (TF-IDF)) as well as the cosine similarity approach to classify new Arabic textual documents.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0254937
Author(s):  
Serhad Sarica ◽  
Jianxi Luo

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.


2019 ◽  
Vol 8 (4) ◽  
pp. 2594-2602

The need for generating automated sentiment on audience feedbacks has been the need of the hour. Manually going through the entire movie feedback becomes tedious therefore an attempt to predict the polarity of a movie based on the reviews using machine learning models is done. Usage of the IMDB movie reviews dataset has been done for training and testing. In this study we also try to depict the real-life problems of class imbalance and train-test splits, hence obtaining solutions for the same. The problem of class imbalance in today’s world has affected a large amount of predictive applications such as cancer detection , fraudulent transactions in banks etc, hence this study is an attempt to perform a solution to solve the class imbalance problem. Use of the undersampling method has been done in this study to improve the accuracy of an imbalanced class. Feature extraction methods such as Bag of Words and Term Frequency Inverse document Frequency have been used to generate features from the reviews. The Logistic regression and SVM classifiers have been used in the study to measure the accuracy. Along with the accuracy the Confusion Matrix has also been calculated to showcase the class imbalance taking its effect on the accuracy.


Sign in / Sign up

Export Citation Format

Share Document