Fake News Detection from Online media using Machine learning Classifiers

Abstract With the advancement in technology, the consumption of news has shifted from Print media to social media. The convenience and accessibility are major factors that have contributed to this shift in consumption of the news. However, this change has bought upon a new challenge in the form of “Fake news” being spread with not much supervision available on the net. In this paper, this challenge has been addressed through a Machine learning concept. The algorithms such as K-Nearest Neighbor, Support Vector Machine, Decision Tree, Naïve Bayes and Logistic regression Classifiers to identify the fake news from real ones in a given dataset and also have increased the efficiency of these algorithms by pre-processing the data to handle the imbalanced data more appropriately. Additionally, comparison of the working of these classifiers is presented along with the results. The model proposed has achieved an accuracy of 89.98% for KNN, 90.46% for Logistic Regression, 86.89% for Naïve Bayes, 73.33% for Decision Tree and 89.33% for SVM in our experiment.

Download Full-text

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

CogITo Smart Journal ◽

10.31154/cogito.v1i1.2.13-23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Debby Erce Sondakh

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Decision Rules ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Accuracy Estimation ◽

F Measure

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

Download Full-text

KOMPARASI METODE KLASIFIKASI PADA ANALISIS SENTIMEN USAHA WARALABA BERDASARKAN DATA TWITTER

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v15i2.752 ◽

2019 ◽

Vol 15 (2) ◽

pp. 267-274

Author(s):

Tati Mardiana ◽

Hafiz Syahreva ◽

Tuslaela Tuslaela

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Confusion Matrix ◽

Naïve Bayes ◽

Support Vector ◽

K Nearest Neighbor

Saat ini usaha waralaba di Indonesia memiliki daya tarik yang relatif tinggi. Namun, para pelaku usaha banyak juga yang mengalami kegagalan. Bagi seseorang yang ingin memulai usaha perlu mempertimbangkan sentimen masyarakat terhadap usaha waralaba. Meskipun demikian, tidak mudah untuk melakukan analisis sentimen karena banyaknya jumlah percakapan di Twitter terkait usaha waralaba dan tidak terstruktur. Tujuan penelitian ini adalah melakukan komparasi akurasi metode Neural Network, K-Nearest Neighbor, Naïve Bayes, Support Vector Machine, dan Decision Tree dalam mengekstraksi atribut pada dokumen atau teks yang berisi komentar untuk mengetahui ekspresi didalamnya dan mengklasifikasikan menjadi komentar positif dan negatif. Penelitian ini menggunakan data realtime dari tweets pada Twitter. Selanjutnya mengolah data tersebut dengan terlebih dulu membersihkannya dari noise dengan menggunakan Phyton. Hasil pengujian dengan confusion matrix diperoleh nilai akurasi Neural Network sebesar 83%, K-Nearest Neighbor sebesar 52%, Support Vector Machine sebesar 83%, dan Decision Tree sebesar 81%. Penelitian ini menunjukkan metode Support Vector Machine dan Neural Network paling baik untuk mengklasifikasikan komentar positif dan negatif terkait usaha waralaba.

Download Full-text

Klasifikasi Jenis Pemeliharaan dan Perawatan Container Crane menggunakan Algoritma Machine Learning

MATICS ◽

10.18860/mat.v13i1.11525 ◽

2021 ◽

Vol 13 (1) ◽

pp. 21-27

Author(s):

Via Ardianto Nugroho ◽

Derry Pramono Adi ◽

Achmad Teguh Wibowo ◽

MY Teguh Sulistyono ◽

Agustinus Bimo Gumelar

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Decision Tree ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Container Crane ◽

Model Tree

Pada industri jasa pelayanan peti kemas, Terminal Nilam merupakan pelanggan dari PT. BIMA, yang secara khusus bergerak dibidang jasa perbaikan dan perawatan alat berat. Terminal ini menjadi sentral tempat untuk melakukan aktifitas bongkar muat peti kemas domestik yang memiliki empat buah container crane untuk melayani dua kapal. Proses perawatan alat berat seperti container crane yang selama ini beroperasi, agaknya kurang memperhatikan data pengelompokkan atau klasifikasi jenis perawatan yang dibutuhkan oleh alat berat tersebut. Di kemudian hari, alat berat dapat menunjukkan kinerja yang tidak maksimal bahkan dapat berujung pada kecelakaan kerja. Selain itu, kelalaian perawatan container crane juga dapat menyebabkan pembengkakan biaya perawatan lanjut. Target produksi bongkar muat dapat berkurang dan juga keterlambatan jadwal kapal sandar sangat mungkin terjadi. Metode pembelajaran menggunakan mesin atau biasa disebut dengan Machine Learning (ML), dengan mudah dapat melenyapkan kemungkinan-kemungkinan tersebut. ML dalam penelitian ini, kami rancang agar bekerja dengan mengidentifikasi lalu mengelompokkan jenis perawatan container crane yang sesuai, yaitu ringan atau berat. Metode ML yang pilih untuk digunakan dalam penelitian ini yaitu Random Forest, Support Vector Machine, k-Nearest Neighbor, Naïve Bayes, Logistic Regression, J48, dan Decision Tree. Penelitian ini menunjukkan keberhasilan ML model tree dalam melakukan pembelajaran jenis data perawatan container crane (numerik dan kategoris), dengan J48 menunjukkan performa terbaik dengan nilai akurasi dan nilai ROC-AUC mencapai 99,1%. Pertimbangan klasifikasi kami lakukan dengan mengacu kepada tanggal terakhir perawatan, hour meter, breakdown, shutdown, dan sparepart.

Download Full-text

Perbandingan Algoritma Klasifikasi untuk Prediksi Cacat Software dengan Pendekatan CRISP-DM

Jurnal Sains dan Informatika ◽

10.34128/jsi.v7i2.313 ◽

2021 ◽

Vol 7 (2) ◽

pp. 117-126

Author(s):

Nurtriana Hidayati ◽

Joko Suntoro ◽

Galet Guntoro Setiaji

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Software Quality ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

K Nearest Neighbor

Bagian terpenting dalam software quality adalah prediksi cacat software. Prediksi cacat software memanfaatkan pengukuran matriks pengujian software untuk dilakukan klasifikasi yang dapat memperkirakan kualitas modul program, secara umum hasil pengujian software dibagi menjadi dua kelas, yaitu software rentan cacat dan software tidak rentan cacat. Metode machine learning mempunyai kinerja lebih baik untuk menemukan cacat software daripada metode manual. Algoritme klasifikasi dalam machine learning yang pernah digunakan untuk prediksi cacat software antara lain k-Nearest Neighbor (k-NN), Naïve Bayes (NB) dan Decision Tree (CART). Dalam penelitian ini akan dibandingkan kinerja antara algoritme-algoritme klasifikiasi yaitu k-NN, NB, dan CART untuk prediksi cacat software. Software Matrix yang digunakan pada penelitian ini adalah tujuh dataset dari NASA MDP. Hasil penelitian menunjukkan bahwa nilai rata-rata akurasi algoritme CART lebih baik daripada algoritme k-NN dan NB dengan nilai 0,867. Sedangkan nilai rata-rata akurasi algoritme k-NN dan NB masing-masing 0,859 dan 0,778.

Download Full-text

Real Time Smartphone Data for Prediction of Nomophobia Severity using Supervised Machine Learning

10.21467/proceedings.114.11 ◽

2021 ◽

Author(s):

Anshika Arora ◽

Pinaki Chakraborty ◽

M.P.S. Bhatia

Keyword(s):

Machine Learning ◽

Real Time ◽

Undergraduate Students ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Supervised Machine Learning ◽

Support Vector ◽

K Nearest Neighbor ◽

F Measure

Excessive use of smartphones throughout the day having dependency on them for social interaction, entertainment and information retrieval may lead users to develop nomophobia. This makes them feel anxious during non-availability of smartphones. This study describes the usefulness of real time smartphone usage data for prediction of nomophobia severity using machine learning. Data is collected from 141 undergraduate students analyzing their perception about their smartphone using the Nomophobia Questionnaire (NMP-Q) and their real time smartphone usage patterns using a purpose-built android application. Supervised machine learning models including Random Forest, Decision Tree, Support Vector Machines, Naïve Bayes and K-Nearest Neighbor are trained using two features sets where the first feature set comprises only the NMP-Q features and the other comprises real time smartphone usage features along with the NMP-Q features. Performance of these models is evaluated using f-measure and area under ROC and It is observed that all the models perform better when provided with smartphone usage features along with the NMP-Q features. Naïve Bayes outperforms other models in prediction of nomophobia achieving a f-measure value of 0.891 and ROC area value of 0.933.

Download Full-text

Integrating synthetic minority oversampling and gradient boosting decision tree for bogie fault diagnosis in rail vehicles

Proceedings of the Institution of Mechanical Engineers Part F Journal of Rail and Rapid Transit ◽

10.1177/0954409718795089 ◽

2018 ◽

Vol 233 (3) ◽

pp. 312-325 ◽

Cited By ~ 11

Author(s):

Linlin Kou ◽

Yong Qin ◽

Xunjun Zhao ◽

Yong Fu

Keyword(s):

Support Vector Machine ◽

Fault Diagnosis ◽

Fault Detection ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor

Bogies are critical components of a rail vehicle, which are important for the safe operation of rail transit. In this study, the authors analyzed the real vibration data of the bogies of a railway vehicle obtained from a Chinese subway company under four different operating conditions. The authors selected 15 feature indexes – that ranged from time-domain, energy, and entropy – as well as their correlations. The adaptive synthetic sampling approach–gradient boosting decision tree (ADASYN–GBDT) method is proposed for the bogie fault diagnosis. A comparison between ADASYN–GBDT and the three commonly used classifiers (K-nearest neighbor, support vector machine, and Gaussian naïve Bayes), combined with random forest as the feature selection, was done under different test data sizes. A confusion matrix was used to evaluate those classifiers. In K-nearest neighbor, support vector machine, and Gaussian naïve Bayes, the optimal features should be selected first, while the proposed method of this study does not need to select the optimal features. K-nearest neighbor, support vector machine, and Gaussian naïve Bayes produced inaccurate results in multi-class identification. It can be seen that the lowest false detection rates of the proposed ADASYN–GBDT model are 92.95% and 87.81% when proportion of the test dataset is 0.4 and 0.9, respectively. In addition, the ADASYN–GBDT model has the ability to correctly identify a fault, which makes it more practical and suitable for use in railway operations. The entire process (training and testing) was finished in 2.4231 s and the detection procedure took 0.0027 s on average. The results show that the proposed ADASYN–GBDT method satisfied the requirements of real-time performance and accuracy for online fault detection. It might therefore aid in the fault detection of bogies.

Download Full-text

Perbandingan Metode Klasifikasi Multiclass untuk Pemetaan Zona Risiko COVID-19 di Pulau Jawa

Jurnal Komputer dan Informatika ◽

10.35508/jicon.v9i1.3602 ◽

2021 ◽

Vol 9 (1) ◽

pp. 98-107

Author(s):

Jesica Nauli Br. Siringo Ringo ◽

Wahyu Joko Mursalin ◽

Nisrina Citra Nurfadilah ◽

Dwiky Rachmat Ramadhan ◽

Wa Ode Zuhayeni Madjida

Keyword(s):

Neural Network ◽

Data Mining ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Imbalanced Data ◽

Naïve Bayes ◽

K Nearest Neighbor ◽

Missing Value

Penambahan kasus COVID-19 yang besar di Indonesia, khususnya Pulau Jawa, membutuhkan berbagai upaya untuk mengendalikannya. Salah satu upaya efektif yang dapat dilakukan adalah tindakan preventif dengan memberi informasi mengenai kondisi suatu wilayah. Sebagai peringatan kepada masyarakat dan sebagai upaya pengambilan kebijakan daerah, Indonesia mengeluarkan zona risiko sampai pada tingkat kabupaten/kota melalui Satgas Penanganan COVID-19. Pembentukan level zona risiko tersebut menggunakan teknik konvensional yaitu pembobotan skor menggunakan informasi dari tiga jenis indikator. Dengan mempertimbangkan bahwa zona risiko merupakan hal yang penting dalam penentuan kebijakan terkait COVID-19, penelitian ini bertujuan untuk membangun model klasifikasi zona risiko kabupaten/kota di Pulau Jawa menggunakan beberapa teknik klasifikasi data mining dan menentukan model klasifikasi terbaik berdasarkan hasil evaluasi. Teknik klasifikasi yang digunakan sebagai perbandingan dalam penelitian ini adalah naive Bayes, decision tree, k-nearest-neighbor, dan neural network. Sebelum dilakukan pemodelan, data disesuaikan terlebih dahulu pada tahap preprocessing di mana pada tahap tersebut teridentifikasi terdapat permasalahan missing value dan imbalanced data. Permasalahan tersebut diatasi dengan imputasi data dan teknik oversampling. Hasil penelitian menunjukkan bahwa model k-nearest-neighbor merupakan model terbaik dibandingkan tiga model lainnya. Hasil tersebut didasarkan pada ukuran evaluasi keempat model di mana model k-NN memiliki nilai acccuracy, nilai rata-rata makro untuk sensitivitas, spesifisitas, dan ukuran F1 paling tinggi dibandingkan model lainnya.

Download Full-text

ANALISA 4 ALGORITMA DALAM KLASIFIKASI LIVER MENGGUNAKAN RAPIDMINER

Jurnal Informatika Polinema ◽

10.33795/jip.v6i2.274 ◽

2020 ◽

Vol 6 (2) ◽

pp. 1-9

Author(s):

Annisa Putri Ayudhitama ◽

Utomo Pujianto

Keyword(s):

Neural Network ◽

Machine Learning ◽

Data Mining ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

World Health ◽

K Nearest Neighbor ◽

Health Organization

Hati merupakan salah satu organ penting dalam tubuh manusia yang berfungsi untuk detoksifikasi racun atau penetral racun dari segala sesuatu yang masuk ke dalam tubuh kita, sehingga tubuh menjadi lebih sehat. Hati dapat terserang suatu penyakit yang mampu mengganggu tugasnya, apabila penyakit hati sudah menyerang maka racun akan tersebar ke seluruh tubuh dan membuat tubuh menjadi tidak sehat. Penyakit liver merupakan penyakit hati yang disebabkan oleh virus, alkohol, pola hidup dan lainnya. Menurut data WHO (World Health Organization) menunjukkan hampir 1,2 juta orang per tahun khususnya di Asia Tenggara dan Afrika mengalami kematian akibat terserang penyakit liver. Seseorang sering tidak menyadari atau terlambat mengetahui penyakit liver sehingga ketika diperiksa penyakit liver sudah parah, akan lebih baik apabila dilakukan penanganan lebih awal dengan mengetahui gejala-gejala yang diderita. Data mining mampu membantu diagnosa penyakit liver dengan lebih mudah terutama untuk membantu para dokter dalam menentukan apakah pasien menderita penyakit liver atau tidak, dengan gejala hampir mendekati penyakit liver. Proses diagnosa penyakit liver dilakukan dengan proses klasifikasi dan hasilnya berupa pasien tersebut menderita liver atau tidak. Penelitian ini menggunakan 4 algoritma data mining yaitu Naïve Bayes, K-Nearest Neighbor (KNN), Decision Tree dan Neural Network. Dataset yang digunakan yaitu Indian Liver Patient Dataset (ILPD) dari website UCI Machine Learning Repository. Keempat algoritma tersebut dibandingkan manakah yang lebih baik akurasinya untuk kasus diagnosa penyakit liver. Hasilnya menunjukkan bahwa algoritma Naïve Bayes memiliki akurasi 55,75%, algoritma K-Nearest Neigbor memiliki akurasi 66,36%, algoritma Decision Tree memiliki akurasi 67,04%, dan algoritma Neural Network memiliki akurasi 70,50%. Akurasi tersebut tergolong rendah karena kelas atau label antara pasien penyakit liver dan pasien tidak memiliki liver tidaklah seimbang, kelas pasien penyakit liver lebih banyak dibandingkan pasien tidak memiliki liver, sehingga banyak data yang diklasifikasikan sebagai pasien penyakit liver. Keywords— Data Mining, Decision Tree, Klasifikasi, KNN, Liver, Naïve Bayes, Neural Network

Download Full-text

Multiclass Severity Classification for Software Bugs Using Support Vector Machine, K-Nearest Neighbor, Decision Tree and Naïve Bayes

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9348 ◽

2020 ◽

Vol 17 (11) ◽

pp. 5109-5112

Author(s):

Raj Kumar ◽

Sanjay Singla

Keyword(s):

Decision Tree ◽

Software Development ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Data Mining Algorithm ◽

K Nearest Neighbor ◽

Software Bugs ◽

The Impact

During the software development, all most 30–35 present cost is due to the testing. This means that if a bug travels from one phase to succeeding phases without detection, it will definitely increase the cost of the software development and due to this software quality may be compromised. So use of the data mining algorithm for the software bug classification is highly appreciable. Bug severity may be categorised into S1, S2, S3, S4 and S5 categories, depending on the impact of the severity. In this paper, multiclass of bug severity is done using SVM, KNN, Decision Tree and Naïve Bayes. Comparative analysis of these algorithms is done with respect to accuracy, precision, recall and execution time.

Download Full-text

Classification methods comparison for customer churn prediction in the telecommunication industry

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.12.001 ◽

2021 ◽

Vol 8 (12) ◽

pp. 1-8

Author(s):

Makruf et al. ◽

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Telecommunication Service ◽

Support Vector ◽

Telecommunication Industry ◽

Classification Methods ◽

K Nearest Neighbor ◽

Customer Churn

The need for telecommunication services has increased dramatically in schools, offices, entertainment, and other areas. On the other hand, the competition between telecommunication companies is getting tougher. Customer churn is one of the areas that each company gains more competitive advantage. This paper proposes a comparison of several classification methods to make a prediction whether the customers cancel the subscription to a telecommunication service by highlighting key factors of customer churn or not. The comparison is non-trivial due to the urgent requirements from the telecommunication industry to infer the most appropriate techniques in analyzing their customer churn. This comparison is often of huge commercial value. The result shows that Artificial Neural Network (ANN) can predict churn with an accuracy of 79%, Support Vector Machine (SVM) with 78% accuracy, Gaussian Naïve Bayes, and K-Nearest Neighbor (KNN) with 75% accuracy, while Decision Tree with 70% accuracy. Moreover, the technique with the highest F-Measure is Gaussian Naïve Bayes with 65% and the technique with the lowest one is Decision Tree with 49%. Hence, ANN and Gaussian Naïve Bayes are two methods with high recommendation to predict the customer churn in the telecommunication industry.

Download Full-text