Random Forest Classifier untuk Deteksi Penderita COVID-19 berbasis Citra CT Scan

Covid-19 merupakan virus yang menyebar dan meluas sehingga berubah menjadi suatu pandemi. Virus Covid-19 menyerang melalui organ vital manusia yaitu paru-patu, oleh karena itu peneliti lebih berfokus untuk mengidentifikasi Covid-19 pada paru-paru. Penelitian ini dilakukan dengan menggunakan citra CT Scan paru-paru dan bertujuan untuk mendeteksi ada tidaknya virus dengan cara mengklasifikasikan citra Covid-19 ke dalam tiga kelas menggunakan algoritma Random Forest serta mengkombinasikannya dengan menyertakan beberapa ekstraksi fitur yaitu Haralick, Color Histogram, dan Hu-Moments. Penelitian dimulai dengan hanya memasukkan satu fitur ke dalam percobaan, lalu mengkombinasikan dengan fitur yang lain, kemudian membandingkannya menggunakan klasifikasi oleh algoritma lain seperti K-Nearest Neighbor (KNN), Decision Tree, Linear Discriminant Analysis (LDA), Logistic Regression, Support Vector Machine (SVM), dan Naive Bayes. Hasil penelitian menunjukkan bahwa akurasi tertinggi dihasilkan oleh algoritma Random Forest dengan memasukkan fitur Haralick dan Color Histogram ke dalam proses yaitu sebesar 96,9%, diikuti oleh KNN sebesar 96,5%, Decision Tree sebesar 95,5%, dan yang paling rendah yaitu Naive Bayes sebesar 42,4%

Download Full-text

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS: HOLDOUTS AS ACCURACY ESTIMATION

CogITo Smart Journal ◽

10.31154/cogito.v1i1.2.13-23 ◽

2016 ◽

Vol 1 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Debby Erce Sondakh

Keyword(s):

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Decision Rules ◽

Naïve Bayes ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbor ◽

Accuracy Estimation ◽

F Measure

Penelitian ini bertujuan untuk mengukur dan membandingkan kinerja lima algoritma klasifikasi teks berbasis pembelajaran mesin, yaitu decision rules, decision tree, k-nearest neighbor (k-NN), naïve Bayes, dan Support Vector Machine (SVM), menggunakan dokumen teks multi-class. Perbandingan dilakukan pada efektifiatas algoritma, yaitu kemampuan untuk mengklasifikasi dokumen pada kategori yang tepat, menggunakan metode holdout atau percentage split. Ukuran efektifitas yang digunakan adalah precision, recall, F-measure, dan akurasi. Hasil eksperimen menunjukkan bahwa untuk algoritma naïve Bayes, semakin besar persentase dokumen pelatihan semakin tinggi akurasi model yang dihasilkan. Akurasi tertinggi naïve Bayes pada persentase 90/10, SVM pada 80/20, dan decision tree pada 70/30. Hasil eksperimen juga menunjukkan, algoritma naïve Bayes memiliki nilai efektifitas tertinggi di antara lima algoritma yang diuji, dan waktu membangun model klasiifikasi yang tercepat, yaitu 0.02 detik. Algoritma decision tree dapat mengklasifikasi dokumen teks dengan nilai akurasi yang lebih tinggi dibanding SVM, namun waktu membangun modelnya lebih lambat. Dalam hal waktu membangun model, k-NN adalah yang tercepat namun nilai akurasinya kurang.

Download Full-text

KOMPARASI METODE KLASIFIKASI PADA ANALISIS SENTIMEN USAHA WARALABA BERDASARKAN DATA TWITTER

Jurnal Pilar Nusa Mandiri ◽

10.33480/pilar.v15i2.752 ◽

2019 ◽

Vol 15 (2) ◽

pp. 267-274

Author(s):

Tati Mardiana ◽

Hafiz Syahreva ◽

Tuslaela Tuslaela

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Confusion Matrix ◽

Naïve Bayes ◽

Support Vector ◽

K Nearest Neighbor

Saat ini usaha waralaba di Indonesia memiliki daya tarik yang relatif tinggi. Namun, para pelaku usaha banyak juga yang mengalami kegagalan. Bagi seseorang yang ingin memulai usaha perlu mempertimbangkan sentimen masyarakat terhadap usaha waralaba. Meskipun demikian, tidak mudah untuk melakukan analisis sentimen karena banyaknya jumlah percakapan di Twitter terkait usaha waralaba dan tidak terstruktur. Tujuan penelitian ini adalah melakukan komparasi akurasi metode Neural Network, K-Nearest Neighbor, Naïve Bayes, Support Vector Machine, dan Decision Tree dalam mengekstraksi atribut pada dokumen atau teks yang berisi komentar untuk mengetahui ekspresi didalamnya dan mengklasifikasikan menjadi komentar positif dan negatif. Penelitian ini menggunakan data realtime dari tweets pada Twitter. Selanjutnya mengolah data tersebut dengan terlebih dulu membersihkannya dari noise dengan menggunakan Phyton. Hasil pengujian dengan confusion matrix diperoleh nilai akurasi Neural Network sebesar 83%, K-Nearest Neighbor sebesar 52%, Support Vector Machine sebesar 83%, dan Decision Tree sebesar 81%. Penelitian ini menunjukkan metode Support Vector Machine dan Neural Network paling baik untuk mengklasifikasikan komentar positif dan negatif terkait usaha waralaba.

Download Full-text

Comparison of Naive Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression Classifiers for Text Reviews Classification

Baltic Journal of Modern Computing ◽

10.22364/bjmc.2017.5.2.05 ◽

2017 ◽

Vol 5 (2) ◽

Cited By ~ 22

Author(s):

Tomas Pranckevičius ◽

Virginijus Marcinkevičius

Keyword(s):

Logistic Regression ◽

Support Vector Machines ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Vector Machines

Download Full-text

PERBANDINGAN ALGORITMA KLASIFIKASI DALAM PENGKLASIFIKASIAN DATA PENYAKIT JANTUNG KORONER

Jurnal Ilmiah Teknologi dan Rekayasa ◽

10.35760/tr.2019.v24i3.2393 ◽

2019 ◽

Vol 24 (3) ◽

pp. 161-170

Author(s):

Ardea Bagas Wibisono ◽

Achmad Fahrurozi

Keyword(s):

Random Forest ◽

Decision Tree ◽

Cross Validation ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Performance Measure ◽

K Nearest Neighbor

Penyakit Jantung Koroner (PJK) menjadi penyebab kematian tertinggi pada semua umur setelah stroke. Hal ini mendorong banyak penelitian terhadap penyakit jantung koroner, salah satunya menggunakan metode berbasis komputer. Pengolahan data dalam jumlah besar dapat dilakukan dengan klasifikasi menggunakan algoritma tertentu sehingga hasilnya cepat dan akurat. Metode klasifikasi yang umum digunakan antara lain Naïve Bayes, K-Nearest Neighbor, Decision Tree dan Random Forest. Metode Naïve Bayes menggunakan probabilitas disetiap data, metode K-Nearest Neighbor menggunakan perhitungan jarak, metode Decision Tree menggunakan pohon keputusan, sedangkan metode Random Forest menggunakan beberapa pohon keputusan yang disatukan. Penelitian ini bertujuan untuk membandingkan keempat algoritma tersebut dalam mengklasifikasikan data penyakit jantung koroner. Perbandingan algoritma akan dilihat berdasarkan performance measure yang terdiri dari tingkatan akurasi, recall disetiap kelas, dan presisi disetiap kelas. Pada setiap algoritma diuji menggunakan cross validation. Berdasarkan hasil perbandingan terhadap 300 dataset penyakit jantung koroner, algoritma Random Forest lebih baik dan optimal dibanding dengan Algoritma Naïve Bayes, K-Nearest Neighbor, dan Decision Tree untuk mengklasifikasikan penyakit jantung koroner. Hasil klasifikasi dengan algoritma Random Forest memiliki rerata tingkat akurasi sebesar 85,668 % dengan recall kelas ’1’ adalah 89 %, recall kelas ’0’ adalah 83,6%, presisi kelas ’1’ adalah 85%, dan presisi kelas ’0’ adalah 85,8%.

Download Full-text

Penanganan Ketidakseimbangan Data pada Prediksi Customer Churn Menggunakan Kombinasi SMOTE dan Boosting

IJCIT (Indonesian Journal on Computer and Information Technology) ◽

10.31294/ijcit.v6i1.9545 ◽

2021 ◽

Vol 6 (1) ◽

Author(s):

Nana Suryana ◽

Pratiwi Pratiwi ◽

Rizki Tri Prasetio

Keyword(s):

Data Mining ◽

Deep Learning ◽

Random Forest ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

K Nearest Neighbor ◽

Customer Churn ◽

Number Of Customers

Industri telekomunikasi menghadapi persaingan yang ketat antara penyedia layanan (service provider). Persaingan ini mengakibatkan customer churn atau berpindahnya pelanggan dari satu layanan ke layanan lain. Customer churn menjadi masalah utama karena dapat mempengaruhi pendapatan perusahaan, profitabilitas, serta kelangsungan hidup perusahaan. Oleh karena itu, mengetahui pelanggan yang akan melakukan churn secara dini menjadi salah satu cara yang cukup efektif dilakukan, karena dapat membantu perusahaan dalam membuat rencana yang efektif untuk tetap mempertahankan pelanggannya. Jumlah pelanggan yang mengundurkan diri dari layanannya saat ini biasanya dimiliki perusahaan dalam jumlah yang sedikit. Kondisi kekurangan data ini menyebabkan kesulitan dalam memprediksi customer churn. Tujuan umum dari penelitian ini adalah memprediksi pelanggan yang akan berpindah ke layanan lain atau mengundurkan diri dari layanannya saat ini. Sementara tujuan khusus penelitian Penelitian ini berusaha menangani ketidakseimbangan data dalam prediksi customer churn menggunakan optimasi pada level data melalui metode sampling yaitu Synthetic Minority Over Sampling. Kemudian dikombinasikan dengan optimasi level algoritma melalui pendekatan teknik Boosting. Pada penelitian beberapa algoritma prediksi seperti random forest, naïve bayes, decision tree, k-nearest neighbor dan deep learning yang akan diimplementasikan untuk mengetahui algoritma yang paling baik setelah dilakukan optimasi menggunakan SMOTE dan Boosting. Metode penelitian yang digunakan pada penelitian ini adalah CRISP-DM, yang merupakan kerangka penelitian data mining untuk penelitian lintas industri. Hasil penelitian ini menunjukan bahwa algoritma random forest merupakan algoritma yang menghasilkan akurasi paling optimal setelah dioptimasi menggunakan SMOTE dan Boosting dengan hasil akurasi 89,19%. The telecommunications industry faces stiff competition between service providers. This competition results in customer churn. Customer churn is a major problem because it can affect company revenue, profitability, survival, and service quality of the company. Therefore, knowing which customers will churn in the future early is one of the most effective ways to do it, because it can help companies make an effective plan to keep their customers. The number of customers who withdrew from its current services is usually owned by a small number. This lack of data causes difficulties in predicting customer churn. This problem then becomes a challenging issue in machine learning. The general purpose of this research is to predict customers who will churn. While the specific purpose of this research is to try to deal with data imbalances in predicting customer churn using optimization at the data level through the sampling method, namely Synthetic Minority Over Sampling (SMOTE). Then combined with algorithm level optimization through the Boosting technique approach. In this study, several prediction algorithms like the random forest, naïve Bayes, decision tree, k-nearest neighbor, and deep learning will be implemented to find out the best algorithm after optimization using SMOTE and Boosting. The method used in this study is CRISP-DM, which is a data mining research framework for cross-industry research. The results of this study indicate that the random forest algorithm is an algorithm that produces the most optimal accuracy after being optimized using SMOTE and Boosting with an accuracy of 89.19%.

Download Full-text

Performance Analysis of Supervised Machine Learning Algorithms on Medical Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7908.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1637-1642

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

Learning System ◽

Supervised Machine Learning ◽

Support Vector ◽

Heart Problem

Machine learning (ML) algorithms are designed to perform prediction based on features. With the help of machine learning, system can automatically learn and improve by experience. Machine learning comes under Artificial intelligence. Machine learning is broadly categorized in two types: supervised and unsupervised. Supervised ML performs classification and unsupervised is for clustering. In present scenario, machine learning is used in various areas. It can be used for biometric recognition, hand writing recognition, medical diagnosis etc. In medical field, machine learning plays an important role in identifying diseases based on patient’s features. Presently,doctors use software application based on machine learning algorithm in various disease diagnosis like cancer, cardiac arrest and many more. In this paper we used an ensemble learning method to predict heart problem. Our study described the performance of ML algorithms by comparing various evaluating parameters such as F-measure, Recall, ROC, precision and accuracy. The study done with various combination ML classifiers such as, Decision Tree (DT), Naïve Bayes (NB), Support Vector Machine (SVM), Random Forest (RF) algorithm to predict heart problem. The result showed that by combining two ML algorithm, DT with NB, 81.1% accuracy was achieved. Simultaneously, the models like Support Vector machine (SVM), Decision tree, Naïve Bayes, Random Forest models were also trained and tested individually.

Download Full-text

Integrating synthetic minority oversampling and gradient boosting decision tree for bogie fault diagnosis in rail vehicles

Proceedings of the Institution of Mechanical Engineers Part F Journal of Rail and Rapid Transit ◽

10.1177/0954409718795089 ◽

2018 ◽

Vol 233 (3) ◽

pp. 312-325 ◽

Cited By ~ 11

Author(s):

Linlin Kou ◽

Yong Qin ◽

Xunjun Zhao ◽

Yong Fu

Keyword(s):

Support Vector Machine ◽

Fault Diagnosis ◽

Fault Detection ◽

Decision Tree ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbor

Bogies are critical components of a rail vehicle, which are important for the safe operation of rail transit. In this study, the authors analyzed the real vibration data of the bogies of a railway vehicle obtained from a Chinese subway company under four different operating conditions. The authors selected 15 feature indexes – that ranged from time-domain, energy, and entropy – as well as their correlations. The adaptive synthetic sampling approach–gradient boosting decision tree (ADASYN–GBDT) method is proposed for the bogie fault diagnosis. A comparison between ADASYN–GBDT and the three commonly used classifiers (K-nearest neighbor, support vector machine, and Gaussian naïve Bayes), combined with random forest as the feature selection, was done under different test data sizes. A confusion matrix was used to evaluate those classifiers. In K-nearest neighbor, support vector machine, and Gaussian naïve Bayes, the optimal features should be selected first, while the proposed method of this study does not need to select the optimal features. K-nearest neighbor, support vector machine, and Gaussian naïve Bayes produced inaccurate results in multi-class identification. It can be seen that the lowest false detection rates of the proposed ADASYN–GBDT model are 92.95% and 87.81% when proportion of the test dataset is 0.4 and 0.9, respectively. In addition, the ADASYN–GBDT model has the ability to correctly identify a fault, which makes it more practical and suitable for use in railway operations. The entire process (training and testing) was finished in 2.4231 s and the detection procedure took 0.0027 s on average. The results show that the proposed ADASYN–GBDT method satisfied the requirements of real-time performance and accuracy for online fault detection. It might therefore aid in the fault detection of bogies.

Download Full-text

Classification Breast Cancer Revisited with Machine Learning

International Journal on Data Science ◽

10.18517/ijods.1.1.42-50.2020 ◽

2020 ◽

Vol 1 (1) ◽

pp. 42-50

Author(s):

Hanna Arini Parhusip ◽

Bambang Susanto ◽

Lilik Linawati ◽

Suryasatriya Trihandaru ◽

Yohanes Sardjono ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm ◽

K Nearest Neighbor ◽

Cancer Data

The article presents the study of several machine learning algorithms that are used to study breast cancer data with 33 features from 569 samples. The purpose of this research is to investigate the best algorithm for classification of breast cancer. The data may have different scales with different large range one to the other features and hence the data are transformed before the data are classified. The used classification methods in machine learning are logistic regression, k-nearest neighbor, Naive bayes classifier, support vector machine, decision tree and random forest algorithm. The original data and the transformed data are classified with size of data test is 0.3. The SVM and Naive Bayes algorithms have no improvement of accuracy with random forest gives the best accuracy among all. Therefore the size of data test is reduced to 0.25 leading to improve all algorithms in transformed data classifications. However, random forest algorithm still gives the best accuracy.

Download Full-text

A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2021.04.03 ◽

2021 ◽

Vol 13 (4) ◽

pp. 24-37

Author(s):

Avijit Kumar Chaudhuri ◽

◽

Dilip K. Banerjee ◽

Anirban Das

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

World Health ◽

Support Vector ◽

Kappa Statistics

World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold crossvalidations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data –namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.

Download Full-text

Multiclass Severity Classification for Software Bugs Using Support Vector Machine, K-Nearest Neighbor, Decision Tree and Naïve Bayes

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9348 ◽

2020 ◽

Vol 17 (11) ◽

pp. 5109-5112

Author(s):

Raj Kumar ◽

Sanjay Singla

Keyword(s):

Decision Tree ◽

Software Development ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Data Mining Algorithm ◽

K Nearest Neighbor ◽

Software Bugs ◽

The Impact

During the software development, all most 30–35 present cost is due to the testing. This means that if a bug travels from one phase to succeeding phases without detection, it will definitely increase the cost of the software development and due to this software quality may be compromised. So use of the data mining algorithm for the software bug classification is highly appreciable. Bug severity may be categorised into S1, S2, S3, S4 and S5 categories, depending on the impact of the severity. In this paper, multiclass of bug severity is done using SVM, KNN, Decision Tree and Naïve Bayes. Comparative analysis of these algorithms is done with respect to accuracy, precision, recall and execution time.

Download Full-text