scholarly journals Phisher Fighter: Website Phishing Detection System Based on URL and Term Frequency-Inverse Document Frequency Values

Author(s):  
E. Sri Vishva ◽  
D. Aju

Fundamentally, phishing is a common cybercrime that is indulged by the intruders or hackers on naive and credible individuals and make them to reveal their unique and sensitive information through fictitious websites. The primary intension of this kind of cybercrime is to gain access to the ad hominem or classified information from the recipients. The obtained data comprises of information that can very well utilized to recognize an individual. The purloined personal or sensitive information is commonly marketed in the online dark market and subsequently these information will be bought by the personal identity brigands. Depending upon the sensitivity and the importance of the stolen information, the price of a single piece of purloined information would vary from few dollars to thousands of dollars. Machine learning (ML) as well as Deep Learning (DL) are powerful methods to analyse and endeavour against these phishing attacks. A machine learning based phishing detection system is proposed to protect the website and users from such attacks. In order to optimize the results in a better way, the TF-IDF (Term Frequency-Inverse Document Frequency) value of webpages is employed within the system. ML methods such as LR (Logistic Regression), RF (Random Forest), SVM (Support Vector Machine), NB (Naive Bayes) and SGD (Stochastic Gradient Descent) are applied for training and testing the obtained dataset. Henceforth, a robust phishing website detection system is developed with 90.68% accuracy.

Author(s):  
Syaifulloh Amien Pandega Perdana ◽  
Teguh Bharata Aji ◽  
Ridi Ferdiana

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.


2021 ◽  
Vol 14 (1) ◽  
pp. 49
Author(s):  
Tezza Fazar Tri Hidayat ◽  
Garno Garno ◽  
Azhari Ali Ridha

Relokasi ibu kota Indonesia kini telah diresmikan oleh Presiden Joko Widodo pada 26 Agustus 2019 ke Kalimantan, ini adalah sejarah baru dalam sejarah Indonesia karena belum pernah terjadi sebelumnya, sehingga memunculkan banyak pendapat atau tanggapan dari masyarakat. Analisis sentimen adalah kegiatan yang digunakan untuk menganalisis pendapat atau opini seseorang tentang suatu topik. Twitter adalah media sosial yang digunakan untuk mengekspresikan pendapat pengguna dan menyatukannya pada suatu topik. Support Vector Machine adalah metode text mining yang mencakup metode klasifikasi dan Term Frequency - Inverse Document Frequency adalah metode pembobotan karakter. SVM dan TF-IDF dapat digunakan untuk menganalisis sentimen opini publik tentang topik pemindahan ibukota Indonesia. Tujuan dari penelitian ini adalah untuk mengklasifikasikan opini publik tentang topik memindahkan Ibu Kota Indonesia dari ribuan tweet yang telah dikumpulkan dan disaring. Tweet pada dari 22-29 Maret 2020 telah diproses menjadi 992 tweet dan terdiri dari 221 data dengan label positif dan 771 data negatif. Dan menggunakan metode SVM yang memiliki akurasi 77,72% dan dikombinasikan dengan TFIDF yang meningkatkan akurasinya menjadi 78,33%.


2019 ◽  
Vol 6 (1) ◽  
pp. 138-149
Author(s):  
Ukhti Ikhsani Larasati ◽  
Much Aziz Muslim ◽  
Riza Arifudin ◽  
Alamsyah Alamsyah

Data processing can be done with text mining techniques. To process large text data is required a machine to explore opinions, including positive or negative opinions. Sentiment analysis is a process that applies text mining methods. Sentiment analysis is a process that aims to determine the content of the dataset in the form of text is positive or negative. Support vector machine is one of the classification algorithms that can be used for sentiment analysis. However, support vector machine works less well on the large-sized data. In addition, in the text mining process there are constraints one is number of attributes used. With many attributes it will reduce the performance of the classifier so as to provide a low level of accuracy. The purpose of this research is to increase the support vector machine accuracy with implementation of feature selection and feature weighting. Feature selection will reduce a large number of irrelevant attributes. In this study the feature is selected based on the top value of K = 500. Once selected the relevant attributes are then performed feature weighting to calculate the weight of each attribute selected. The feature selection method used is chi square statistic and feature weighting using Term Frequency Inverse Document Frequency (TFIDF). Result of experiment using Matlab R2017b is integration of support vector machine with chi square statistic and TFIDF that uses 10 fold cross validation gives an increase of accuracy of 11.5% with the following explanation, the accuracy of the support vector machine without applying chi square statistic and TFIDF resulted in an accuracy of 68.7% and the accuracy of the support vector machine by applying chi square statistic and TFIDF resulted in an accuracy of 80.2%.


2021 ◽  
Vol 10 (2) ◽  
pp. 742-749
Author(s):  
Tanvirul Islam ◽  
Ashik Iqbal Prince ◽  
Md. Mehedee Zaman Khan ◽  
Md. Ismail Jabiullah ◽  
Md. Tarek Habib

Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.


2020 ◽  
pp. 016555152091765 ◽  
Author(s):  
Ibrahim Aljarah ◽  
Maria Habib ◽  
Neveen Hijazi ◽  
Hossam Faris ◽  
Raneem Qaddoura ◽  
...  

Nowadays, cyber hate speech is increasingly growing, which forms a serious problem worldwide by threatening the cohesion of civil societies. Hate speech relates to using expressions or phrases that are violent, offensive or insulting for a person or a minority of people. In particular, in the Arab region, the number of Arab social media users is growing rapidly, which is accompanied with high increasing rate of cyber hate speech. This drew our attention to aspire healthy online environments that are free of hatred and discrimination. Therefore, this article aims to detect cyber hate speech based on Arabic context over Twitter platform, by applying Natural Language Processing (NLP) techniques, and machine learning methods. The article considers a set of tweets related to racism, journalism, sports orientation, terrorism and Islam. Several types of features and emotions are extracted and arranged in 15 different combinations of data. The processed dataset is experimented using Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and Random Forest (RF), in which RF with the feature set of Term Frequency-Inverse Document Frequency (TF-IDF) and profile-related features achieves the best results. Furthermore, a feature importance analysis is conducted based on RF classifier in order to quantify the predictive ability of features in regard to the hate class.


Hoax news on social media has had a dramatic effect on our society in recent years. The impact of hoax news felt by many people, anxiety, financial loss, and loss of the right name. Therefore we need a detection system that can help reduce hoax news on social media. Hoax news classification is one of the stages in the construction of a hoax news detection system, and this unsupervised learning algorithm becomes a method for creating hoax news datasets, machine learning tools for data processing, and text processing for detecting data. The next will produce a classification of a hoax or not a Hoax based on the text inputted. Hoax news classification in this study uses five algorithms, namely Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, Stochastic Gradient Descent, and Neural Network (MLP). These five algorithms to produce the best algorithm that can use to detect hoax news, with the highest parameters, accuracy, F-measure, Precision, and recall. From the results of testing conducted on five classification algorithms produced shows that the NN-MPL algorithm has an average of 93% for the value of accuracy, F-Measure, and Precision, the highest compared to five other algorithms, but for the highest Recall value generated from the algorithm SVM which is 94%. the results of this experiment show that different effects for different classifiers, and that means that the more hoax data used as training data, the more accurate the system calculates accuracy in more detail.


Sign in / Sign up

Export Citation Format

Share Document