PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

2013 ◽  
Vol 655-657 ◽  
pp. 963-968
Author(s):  
Yan Feng Zhang ◽  
Ting Ting Li

C4.5, Bayesian network and Sequential Minimal Optimization (SMO) are three typical classification algorithms in data mining. Using cross-validation method with 10 folds get analysis and calculation results of the experiments for the three classification algorithms in the same training set and test set. The main metrics include accuracy, precision, speed, robustness, scalability and comprehensibility, we use margin curve show these. It provides a theoretical and experimental basis for users to select a proper classification algorithm with different training sets in quality and amount.


2020 ◽  
Vol 7 (2) ◽  
pp. 200
Author(s):  
Puji Santoso ◽  
Rudy Setiawan

One of the tasks in the field of marketing finance is to analyze customer data to find out which customers have the potential to do credit again. The method used to analyze customer data is by classifying all customers who have completed their credit installments into marketing targets, so this method causes high operational marketing costs. Therefore this research was conducted to help solve the above problems by designing a data mining application that serves to predict the criteria of credit customers with the potential to lend (credit) to Mega Auto Finance. The Mega Auto finance Fund Section located in Kotim Regency is a place chosen by researchers as a case study, assuming the Mega Auto finance Fund Section has experienced the same problems as described above. Data mining techniques that are applied to the application built is a classification while the classification method used is the Decision Tree (decision tree). While the algorithm used as a decision tree forming algorithm is the C4.5 Algorithm. The data processed in this study is the installment data of Mega Auto finance loan customers in July 2018 in Microsoft Excel format. The results of this study are an application that can facilitate the Mega Auto finance Funds Section in obtaining credit marketing targets in the future


2018 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Qomariyatul Hasanah ◽  
Anang Andrianto ◽  
Muhammad Arief Hidayat

Sistem informasi posyandu ibu hamil dapat mengelola data kesehatan ibu hamil yang berkaitan dengan faktor resiko kehamilan. Faktor resiko kehamilan berdasarkan ketentuan Kartu Skor Poedji Rochyati (KSPR) digunakan bidan untuk menentukan resiko kehamilan dengan memberikan skor pada masing-masing parameter. KSPR memiliki kelemahan tidak dapat memberikan skor pada parameter yang belum pasti sehingga jika belum diketahui dengan pasti maka dianggap tidak terjadi. Konsep membaca pola data yang diadopsi dari teknik datamining menggunakan metode klasifikasi naive bayes dapat menjadi alternatif untuk kelemahan KSPR tersebut yaitu dengan mengklasifikasikan resiko kehamilan. Metode naïve bayes menghitung probabilitas parameter tertentu berdasarkan data pada periode sebelumnya yang telah ditentukan sebagai data training, berdasarkan hasil perhitungan tersebut dapat diketahui resiko kehamilan secara tepat sesuai parameter yang telah diketahui. Metode naïve bayes dipilih karena memiliki tingkat akurasi yang cukup tinggi daripada metode klasifikasi lainnya. Sistem informasi ini dibangun berbasis website agar dapat diakses secara mudah oleh beberapa posyandu yang berbeda tempat. Sistem dibangun mengadopsi dari model Waterfall. Sistem informasi posyandu ibu hamil dirancang dan dibangun dengan tiga (3) hak akses yaitu admin, bidan dan kader dengan masing-masing fitur yang dapat memudahkan penggunanya. Hasil dari penelitian ini adalah sistem informasi posyandu ibu hamil dengan penerapan klasifikasi resiko kehamilan menggunakan metode naïve bayes, dengan tingkat akurasi ketika menggunakan 17 atribut didapatkan 53.913%, 19 atribut didapatkan 54.348%, , 21 atribut didapatkan 54.783%, dan 22 atribut didapatkan 56.957%. Tingkat akurasi klasifikasi diperoleh menggunakan metode pengujian menggunakan Ten-Fold Cross Validation dimana training set dibagi menjadi 10 kelompok, jika kelompok 1 dijadikan test set maka kelompok 2 hingga 10 menjadi training set. Kata Kunci: Posyandu, Resiko Kehamilan, Waterfall, Datamining, Klasifikasi, Naïve bayes


2016 ◽  
Vol 7 (4) ◽  
Author(s):  
Mochammad Yusa ◽  
Ema Utami ◽  
Emha T. Luthfi

Abstract. Readmission is associated with quality measures on patients in hospitals. Different attributes related to diabetic patients such as medication, ethnicity, race, lifestyle, age, and others result in the calculation of quality care that tends to be complicated. Classification techniques of data mining can solve this problem. In this paper, the evaluation on three different classifiers, i.e. Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes with various settingparameter, is developed by using 10-Fold Cross Validation technique. The targets of parameter performance evaluated is based on term of Accuracy, Mean Absolute Error (MAE), dan Kappa Statistic. The selected dataset consists of 47 attributes and 49.735 records. The result shows that k-NN classifier with k=100 has a better performance in term of accuracy and Kappa Statistic, but Naive Bayes outperforms in term of MAE among other classifiers. Keywords: k-NN, naive bayes, diabetes, readmissionAbstrak. Proses Readmisi dikaitkan dengan perhitungan kualitas penanganan pasien di rumah sakit. Perbedaan atribut-atribut yang berhubungan dengan pasien diabetes proses medikasi, etnis, ras, gaya hidup, umur, dan lain-lain, mengakibatkan perhitungan kualitas cenderung rumit. Teknik klasifikasi data mining dapat menjadi solusi dalam perhitungan kualitas ini. Teknik klasifikasi merupakan salah satu teknik data mining yang perkembangannya cukup signifikan. Di dalam penelitian ini, model algoritma klasifikasi Decision Tree, k-Nearest Neighbor (k-NN), dan Naive Bayes dengan berbagai parameter setting akan dievaluasi performanya berdasarkan nilai performa Accuracy, Mean AbsoluteError (MAE), dan Kappa Statistik dengan metode 10-Fold Cross Validation. Dataset yang dievaluasi memiliki 47 atribut dengan 49.735 records. Hasil penelitian menunjukan bahwa performa accuracy, MAE, dan Kappa Statistik terbaik didapatkan dari Model Algoritma Naive Bayes.Kata Kunci: k-NN, naive bayes, diabetes, readmisi


2022 ◽  
Vol 2 (1) ◽  
pp. 1-24
Author(s):  
Rofiana Simanullang ◽  
Dedy Hartama ◽  
Poningsih Poningsih ◽  
Iin Parlina ◽  
Muhammad R. Lubis

Data nilai siswa merupakan suatu data penting  bagi pihak departemen, maupun pada pihak sekolah karena perlu untuk melihat bagaimana perkembangan nilai siswa-siswi di SMK GKPS 1 Raya dikemudian hari. Data nilai siswa pun semakin bertambah bila semakin tahun berganti tahun, dan data tersebut dapat memberi informasi yang berguna bila diolah dengan baik. Maka dari itu, dalam penelitian ini penulis akan memanfaatkan 202 data nilai siswa yang diperoleh dari pihak sekolah SMK GKPS 1 Raya yang dikelola menggunakan data mining untuk mendapatkan suatu informasi klasifikasi perkembangan nilai siswa dan  menentukan prestasi siswa. Dalam metode ini, algoritma yang digunakan yaitu  algoritma C 4.5 decision tree yang didukung dengan Software RapidMiner, Kriteria yang digunakan adalah seperti NISN, Nama Siswa, Nilai Rata-rata dan Nilai Kehadiran yang di input ke input kedalam Microsoft Excel 2007 dan dilakukan transformasi dari nilai angka ke nilai huruf yang dimana Jika nilai >90 maka nilainya A, 80 – 89 = B, 70 – 79 = C, dan <60 = D. Dengan menggunakan metode ini dapat menjadi salah satu alat untuk dapat membantu pihak sekolah dalam melihat perkembangan nilai siswa, sehingga hasil yang di dapat dalam metode ini dapat menentukan nilai yang berprestasi dan yang tidak berprestasi serta  dapat memberikan rekomendasi untuk pihak sekolah semakin memperbaiki sistem pembelajaran yang berlaku untuk kedepannya.


2021 ◽  
Vol 12 (1) ◽  
pp. 228-242
Author(s):  
Borislava Vrigazova

Abstract Background: The bootstrap can be alternative to cross-validation as a training/test set splitting method since it minimizes the computing time in classification problems in comparison to the tenfold cross-validation. Objectives: Тhis research investigates what proportion should be used to split the dataset into the training and the testing set so that the bootstrap might be competitive in terms of accuracy to other resampling methods. Methods/Approach: Different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross-validation, and the random repeated train/test split to test their performance on several classification methods. The classification methods used include the logistic regression, the decision tree, and the k-nearest neighbours. Results: The findings suggest that using a different structure of the test set (e.g. 30/70, 20/80) can further optimize the performance of the bootstrap when applied to the logistic regression and the decision tree. For the k-nearest neighbour, the tenfold cross-validation with a 70/30 train/test splitting ratio is recommended. Conclusions: Depending on the characteristics and the preliminary transformations of the variables, the bootstrap can improve the accuracy of the classification problem.


2017 ◽  
Author(s):  
Shayan Tabe-Bordbar ◽  
Amin Emad ◽  
Sihai Dave Zhao ◽  
Saurabh Sinha

AbstractCross-validation (CV) is a technique to assess the generalizability of a model to unseen data. This technique relies on assumptions that may not be satisfied when studying genomics datasets. For example, random CV (RCV) assumes that a randomly selected set of samples, the test set, well represents unseen data. This assumption does not hold true where samples are obtained from different experimental conditions, and the goal is to learn regulatory relationships among the genes that generalize beyond the observed conditions. In this study, we investigated how the CV procedure affects the assessment of methods used to learn gene regulatory networks. We compared the performance of a regression-based method for gene expression prediction, estimated using RCV with that estimated using a clustering-based CV (CCV) procedure. Our analysis illustrates that RCV can produce over-optimistic estimates of generalizability of the model compared to CCV. Next, we defined the ‘distinctness’ of a test set from a training set and showed that this measure is predictive of the performance of the regression method. Finally, we introduced a simulated annealing method to construct partitions with gradually increasing distinctness and showed that performance of different gene expression prediction methods can be better evaluated using this method.


2005 ◽  
Vol 10 (7) ◽  
pp. 653-657 ◽  
Author(s):  
Nadine H. Elowe ◽  
Jan E. Blanchard ◽  
Jonathan D. Cechetto ◽  
Eric D. Brown

High-throughput screening (HTS) generates an abundance of data that are a valuable resource to be mined. Dockers and data miners can use “real-world” HTS data to test and further develop their tools. A screen of 50,000 diverse small molecules was carried out against Escherichia coli dihydrofolate reductase (DHFR) and compared with a previous screen of 50,000 compounds against the same target. Identical assays and conditions were maintained for both studies. Prior to the completion of the second screen, the original screening data were publicly released for use as a “training set,” and computational chemists and data analysts were challenged to predict the activity of compounds in this second “test set.” Upon completion, the primary screen of the test set generated no potent inhibitors of DHFR activity.


Author(s):  
AHMET ALPTEKIN ◽  
OLCAY KURSUN

Leave-one-out (LOO) and its generalization, K-Fold, are among most well-known cross-validation methods, which divide the sample into many folds, each one of which is, in turn, left out for testing, while the other parts are used for training. In this study, as an extension of this idea, we propose a new cross-validation approach that we called miss-one-out (MOO) that mislabels the example(s) in each fold and keeps this fold in the training set as well, rather than leaving it out as LOO does. Then, MOO tests whether the trained classifier can correct the erroneous label of the training sample. In principle, having only one fold deliberately labeled incorrectly should have only a small effect on the classifier that uses this bad-fold along with K - 1 good folds and can be utilized as a generalization measure of the classifier. Experimental results on a number of benchmark datasets and three real bioinformatics dataset show that MOO can better estimate the test set accuracy of the classifier.


2021 ◽  
Author(s):  
Zhilong Yi ◽  
Siqi Hu ◽  
Xiaofeng Lin ◽  
Qiong Zou ◽  
MinHong Zou ◽  
...  

Abstract Purpose 68Ga-PSMA PET/CT has high specificity and sensitivity for the detection of both intraprostatic tumor focal lesions and metastasis. However, approximately 10% of primary prostate cancer are invisible on PSMA-PET (exhibit no or minimal uptake). In this work, we investigated whether machine learning-based radiomics models derived from PSMA-PET images could predict invisible intraprostatic lesions on 68Ga-PSMA-11 PET in patients with primary prostate cancer.Methods In this retrospective study, patients with or without prostate cancer who underwent 68Ga-PSMA PET/CT and presented negative on PSMA-PET image at either of two different institutions were included: institution 1 (between 2017 to 2020) for the training set and institution 2 (between 2019 to 2020) for the external test set. Three random forest (RF) models were built using selected features extract from standard PET images, delayed PET images, and both standard and delayed PET images. Then, subsequent 10-fold cross-validation was performed. In the test phase, the three RF models and PSA density (PSAD, cut-off value: 0.15ng/ml/ml) were tested with the external test set. The area under the receiver operating characteristic curve (AUC) was calculated for the models and PSAD. The AUCs of the radiomics model and PSAD were compared.Results A total of 64 patients (39 with prostate cancer and 25 with benign prostate disease) were in the training set, and 36 (21 with prostate cancer and 15 with benign prostate disease) were in the test set. The average AUCs of the three RF models from 10-fold cross-validation were 0.87 (95% CI: 0.72, 1.00), 0.86 (95% CI: 0.63, 1.00) and 0.91 (95% CI: 0.69, 1.00), respectively. In the test set, the AUCs of the three trained RF models and PSAD were 0.903 (95% CI: 0.830, 0.975), 0.856 (95% CI: 0.748, 0.964), 0.925 (95% CI:0.838, 1.00), and 0.662 (95% CI: 0.510, 0.813). The AUCs of the three radiomics models were higher than that of PSAD (0.903, 0.856 and 0.925 vs 0.662, respectively; P = .007, P = .045 and P = .005, respectively).Conclusion Random forest models developed by 68Ga-PSMA-11 PET-based radiomics features were proven useful for accurate prediction of invisible intraprostatic lesion on 68Ga-PSMA-11 PET in patients with primary prostate cancer and showed better diagnostic performance compared with PSAD.


Sign in / Sign up

Export Citation Format

Share Document