Expert cancer model using supervised algorithms with a LASSO selection approach

One of the most critical issues of the mortality rate in the medical field in current times is breast cancer. Nowadays, a large number of men and women is facing cancer-related deaths due to the lack of early diagnosis systems and proper treatment per year. To tackle the issue, various data mining approaches have been analyzed to build an effective model that helps to identify the different stages of deadly cancers. The study successfully proposes an early cancer disease model based on five different supervised algorithms such as logistic regression (henceforth LR), decision tree (henceforth DT), random forest (henceforth RF), Support vector machine (henceforth SVM), and K-nearest neighbor (henceforth KNN). After an appropriate preprocessing of the dataset, least absolute shrinkage and selection operator (LASSO) was used for feature selection (FS) using a 10-fold cross-validation (CV) approach. Employing LASSO with 10-fold cross-validation has been a novel steps introduced in this research. Afterwards, different performance evaluation metrics were measured to show accurate predictions based on the proposed algorithms. The result indicated top accuracy was received from RF classifier, approximately 99.41% with the integration of LASSO. Finally, a comprehensive comparison was carried out on Wisconsin breast cancer (diagnostic) dataset (WBCD) together with some current works containing all features.

Download Full-text

Perbandingan Akurasi dan Waktu Proses Algoritma K-NN dan SVM dalam Analisis Sentimen Twitter

Jurnal Informatika ◽

10.31311/ji.v6i2.5129 ◽

2019 ◽

Vol 6 (2) ◽

pp. 226-235

Author(s):

Muhammad Rangga Aziz Nasution ◽

Mardhiya Hayaty

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Unsupervised Learning ◽

Supervised Learning ◽

Cross Validation ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Fold Cross Validation

Salah satu cabang ilmu komputer yaitu pembelajaran mesin (machine learning) menjadi tren dalam beberapa waktu terakhir. Pembelajaran mesin bekerja dengan memanfaatkan data dan algoritma untuk membuat model dengan pola dari kumpulan data tersebut. Selain itu, pembelajaran mesin juga mempelajari bagaimama model yang telah dibuat dapat memprediksi keluaran (output) berdasarkan pola yang ada. Terdapat dua jenis metode pembelajaran mesin yang dapat digunakan untuk analisis sentimen: supervised learning dan unsupervised learning. Penelitian ini akan membandingkan dua algoritma klasifikasi yang termasuk dari supervised learning: algoritma K-Nearest Neighbor dan Support Vector Machine, dengan cara membuat model dari masing-masing algoritma dengan objek teks sentimen. Perbandingan dilakukan untuk mengetahui algoritma mana lebih baik dalam segi akurasi dan waktu proses. Hasil pada perhitungan akurasi menunjukkan bahwa metode Support Vector Machine lebih unggul dengan nilai 89,70% tanpa K-Fold Cross Validation dan 88,76% dengan K-Fold Cross Validation. Sedangkan pada perhitungan waktu proses metode K-Nearest Neighbor lebih unggul dengan waktu proses 0.0160s tanpa K-Fold Cross Validation dan 0.1505s dengan K-Fold Cross Validation.

Download Full-text

Bayes Classifier dan Support Vector Machine dalam Klasifikasi Judul Karya Akhir Mahasiswa Program Studi PTIK UNJ

PINTER Jurnal Pendidikan Teknik Informatika dan Komputer ◽

10.21009/pinter.3.1.9 ◽

2019 ◽

Vol 3 (1) ◽

pp. 54-62

Author(s):

Razi Aziz Syahputro ◽

Widodo ◽

Hamidillah Ajie

Keyword(s):

Support Vector Machine ◽

Cross Validation ◽

Nearest Neighbor ◽

Confusion Matrix ◽

Vector Space Model ◽

Support Vector ◽

Bayes Classifier ◽

K Nearest Neighbor ◽

Space Model ◽

Fold Cross Validation

Penelitian ini dilatarbelakangi dengan dibutuhkannya sistem pengklasifikasian untuk memudahkan pihak Jurusan Teknik Elektro khususnya Program Studi PTIK untuk mengklasifikasikan judul skripsi berdasarkan peminatan. Sebelum sistem dibuat diperlukan pertimbangan dari beberapa algoritma klasifikasi yang ada, maka dari itu penelitian ini memilih 3 algoritma dari 10 algoritma terbaik menurut ICDM tahun 2006. Klasifikasi terhadap dokumen teks pendek seperti judul skripsi mahasiswa memiliki kesulitan tersendiri daripada dokumen teks panjang karena semakin sedikit kata semakin sulit diklasifikasi. Sehingga tujuan dari penelitian ini adalah untuk mengetahui algoritma yang paling efektif untuk mengklasifikasi judul skripsi. Penelitian ini terdiri dari beberapa tahap yaitu pengumpulan data, pengelompokan data melalui angket oleh dosen ahli, pre-processing text, pembobotan kata menggunakan vector space model dan tf-idf, evaluasi dengan k-fold cross validation, klasifikasi menggunakan k-nearest neighbor, naïve bayes classifier, dan support vector machine, dan analisis dengan confusion matrix. Percobaan dilakukan dengan menggunakan 266 data judul skripsi mahasiswa PTIK UNJ dari angkatan 2010-2013, dengan data terakhir berasal dari sidang skripsi pada semester 105(semester ganjil 2016/2017). Hasil dari klasifikasi menggunakan algoritma tersebut didapatkan algoritma yang paling efisien yaitu support vector machine dengan akurasi 82% dari 10 kali percobaan.

Download Full-text

A hybrid cost-sensitive ensemble for heart disease prediction

10.21203/rs.2.22946/v1 ◽

2020 ◽

Author(s):

Zhenya Qi ◽

Zuoru Zhang

Keyword(s):

Heart Disease ◽

Cross Validation ◽

Nearest Neighbor ◽

Support Vector ◽

K Nearest Neighbor ◽

Misclassification Cost ◽

Proposed Model ◽

Learning Machine ◽

Fold Cross Validation ◽

Very High

Abstract Heart disease is the primary cause of morbidity and mortality in the world. It includes numerous problems and symptoms. The diagnosis of heart disease is difficult because there are too many factors to analyze. What's more, the misclassification cost could be very high. In this paper, I firstly propose a cost-sensitive ensemble model to improve the accuracy of diagnosis and reduce the misclassification cost. The proposed model contains five heterogeneous classifiers: random forest, logistic regression, support vector machine, extreme learning machine and k-nearest neighbor. Then, experiments are done on three datasets from UCI machine learning repository. The highest classification accuracy of 91.74%, highest G-mean of 90.55%, highest precision of 96.11%, highest recall of 89.61% and lowest misclassification cost of 30.32% are achieved by the proposed model according to ten-fold cross validation. The results demonstrate that the performance of the proposed model is superior to those of previously reported classification techniques.

Download Full-text

Studi Komparasi Algoritma Klasifikasi Mental Workload Berdasarkan Sinyal EEG

Jurnal Sistem Cerdas ◽

10.37396/jsc.v3i2.69 ◽

2020 ◽

Vol 3 (2) ◽

pp. 133-143

Author(s):

Dessy Kusumaningrum ◽

Elly Matul Imah

Keyword(s):

Random Forest ◽

Cross Validation ◽

Nearest Neighbor ◽

Mental Workload ◽

Principal Component ◽

Support Vector ◽

Multi Layer Perceptron ◽

K Nearest Neighbor ◽

Electroencephalogram Eeg ◽

Fold Cross Validation

Kondisi psikologis dan fisik manusia dapat memengaruhi proses berpikir. Apabila kondisi individu mengalami kelelahan, maka dapat memengaruhi penurunan tingkat produktivitas maupun penurunan proses berpikir yang menyebabkan timbulnya mental workload. Workload yang dimiliki harus seimbang terhadap kemampuan dan keterbatasan yang dimiliki. Mental workload yang berlebih berdampak buruk bagi individu karena menimbulkan penurunan produktivitas kerja. Perangkat khusus yang dapat digunakan untuk mengetahui tingkat mental workload seorang individu adalah Electroencephalogram (EEG). EEG adalah perangkat khusus yang digunakan untuk mengukur sinyal potensi listrik dari otak. Dataset yang digunakan dalam penelitian ini adalah STEW: Simultaneous Task EEG Dataset dengan 45 subjek. Dalam penelitian ini, telah dilakukan studi komparasi algoritma Random Forest, K-Nearest Neighbor (KNN), Multi-Layer Perceptron (MLP), dan Support Vector Machine (SVM) untuk klasifikasi mental workload berdasarkan sinyal EEG. Studi dilakukan untuk menentukan algoritma terbaik dalam klasifikasi dilihat dari segi nilai akurasi dan penggunaan memori saat proses klasifikasi. Dataset telah melalui beberapa tahapan, diantaranya pra-pemrosesan data, ekstraksi fitur, dan proses klasifikasi. Pra-pemrosesan data menerapkan pembagian data menjadi beberapa chunk. Untuk mendapatkan ciri dalam ekstraksi fitur, diterapkan metode Principal Component Analysis (PCA). Pada proses klasifikasi menggunakan pendekatan k-fold cross validation. Hasil studi penelitian ini adalah algoritma terbaik dari sisi akurasi adalah algoritma KNN, algoritma terbaik dari sisi waktu pembuatan model adalah algoritma Random Forest, serta algoritma terbaik dari sisi penggunaan memori adalah algoritma MLP.

Download Full-text

Komparasi Kinerja Algoritma Data Mining pada Dataset Konsumsi Alkohol Siswa

Khazanah Informatika Jurnal Ilmu Komputer dan Informatika ◽

10.23917/khif.v4i2.7061 ◽

2018 ◽

Vol 4 (2) ◽

pp. 98

Author(s):

Noviyanti Sagala ◽

Hendrik Tampubolon

Keyword(s):

Data Mining ◽

Cross Validation ◽

Nearest Neighbor ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

K Nearest Neighbor ◽

Gain Ratio ◽

Feature Correlation ◽

Fold Cross Validation

Data mining melakukan proses ekstraksi pengetahuan yang diperoleh dari sekumpulan data dalam jumlah besar. Penelitian ini bertujuan untuk menerapkan dan melakukan analisis kinerja algoritma data mining untuk memprediksi konsumsi alkohol dan menganalisis faktor-faktor yang terkait pada siswa tingkat menengah. Adapun tahapan yang dilakukan ialah pra-proses data, seleksi fitur, klasifikasi, dan evaluasi model. Pada tahap praproses, beberapa fitur diubah menjadi bentuk yang sesuai untuk memudahkan proses klasifikasi. Selanjutnya, algoritma Gain Ratio dan Feature Correlation-Based Filter (FCBF) digunakan untuk memilih fitur-fitur yang relevan dan penting untuk digunakan dalam tahapan klasifikasi. Decision Tree C5.0, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), dan Naive Bayes (NB) dieksekusi pada kelompok fitur yang terpilih. Akurasi model yang dibangun dievaluasi menggunakan 10-fold Cross-Validation (CV). Hasil penelitian menunjukkan bahwa model klasifikasi yang dibangun menggunakan Naïve Bayes memiliki nilai akurasi tertinggi dengan menggunakan 5 fitur terbaik dari Gain Ratio. Selain itu, penggunaan metode pemilihan fitur mampu meningkatkan performa dari seluruh klasifier secara umum. Pengujian lebih lanjut pada data yang sama maupun berbeda perlu dilakukan untuk mendapatkan gambaran lebih mendalam mengenai kinerja algoritma-algoritma yang digunakan.

Download Full-text

Presentation of Novel Architecture for Diagnosis and Identifying Breast Cancer Location Based on Ultrasound Images Using Machine Learning

Diagnostics ◽

10.3390/diagnostics11101870 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1870

Author(s):

Yaghoub Pourasad ◽

Esmaeil Zarouri ◽

Mohammad Salemizadeh Parizi ◽

Amin Salih Mohammed

Keyword(s):

Breast Cancer ◽

High Performance ◽

Nearest Neighbor ◽

Support Vector ◽

Ultrasound Images ◽

Morphological Operations ◽

K Nearest Neighbor ◽

Diagnose Breast Cancer ◽

Premature Deaths ◽

Infected Area

Breast cancer is one of the main causes of death among women worldwide. Early detection of this disease helps reduce the number of premature deaths. This research aims to design a method for identifying and diagnosing breast tumors based on ultrasound images. For this purpose, six techniques have been performed to detect and segment ultrasound images. Features of images are extracted using the fractal method. Moreover, k-nearest neighbor, support vector machine, decision tree, and Naïve Bayes classification techniques are used to classify images. Then, the convolutional neural network (CNN) architecture is designed to classify breast cancer based on ultrasound images directly. The presented model obtains the accuracy of the training set to 99.8%. Regarding the test results, this diagnosis validation is associated with 88.5% sensitivity. Based on the findings of this study, it can be concluded that the proposed high-potential CNN algorithm can be used to diagnose breast cancer from ultrasound images. The second presented CNN model can identify the original location of the tumor. The results show 92% of the images in the high-performance region with an AUC above 0.6. The proposed model can identify the tumor’s location and volume by morphological operations as a post-processing algorithm. These findings can also be used to monitor patients and prevent the growth of the infected area.

Download Full-text

Heartbeat Detection by Laser Doppler Vibrometry and Machine Learning

Sensors ◽

10.3390/s20185362 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5362 ◽

Cited By ~ 1

Author(s):

Luca Antognoli ◽

Sara Moccia ◽

Lucia Migliorelli ◽

Sara Casaccia ◽

Lorenzo Scalise ◽

...

Keyword(s):

Machine Learning ◽

Signal Analysis ◽

Cross Validation ◽

Nearest Neighbor ◽

Laser Doppler Vibrometer ◽

Laser Doppler ◽

Support Vector ◽

K Nearest Neighbor ◽

Real World Application ◽

Testing Protocol

Background: Heartbeat detection is a crucial step in several clinical fields. Laser Doppler Vibrometer (LDV) is a promising non-contact measurement for heartbeat detection. The aim of this work is to assess whether machine learning can be used for detecting heartbeat from the carotid LDV signal. Methods: The performances of Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and K-Nearest Neighbor (KNN) were compared using the leave-one-subject-out cross-validation as the testing protocol in an LDV dataset collected from 28 subjects. The classification was conducted on LDV signal windows, which were labeled as beat, if containing a beat, or no-beat, otherwise. The labeling procedure was performed using electrocardiography as the gold standard. Results: For the beat class, the f1-score (f1) values were 0.93, 0.93, 0.95, 0.96 for RF, DT, KNN and SVM, respectively. No statistical differences were found between the classifiers. When testing the SVM on the full-length (10 min long) LDV signals, to simulate a real-world application, we achieved a median macro-f1 of 0.76. Conclusions: Using machine learning for heartbeat detection from carotid LDV signals showed encouraging results, representing a promising step in the field of contactless cardiovascular signal analysis.

Download Full-text

Comparative analysis of breast cancer detection in mammograms and thermograms

Biomedical Engineering / Biomedizinische Technik ◽

10.1515/bmt-2014-0047 ◽

2015 ◽

Vol 60 (1) ◽

Cited By ~ 7

Author(s):

Marina Milosevic ◽

Dragan Jankovic ◽

Aleksandar Peulic

Keyword(s):

Nearest Neighbor ◽

Region Of Interest ◽

Texture Features ◽

Classification Performance ◽

Support Vector ◽

K Nearest Neighbor ◽

Characteristic Analysis ◽

Analysis Society ◽

Fold Cross Validation ◽

Neighbor Classifier

AbstractIn this paper, we present a system based on feature extraction techniques for detecting abnormal patterns in digital mammograms and thermograms. A comparative study of texture-analysis methods is performed for three image groups: mammograms from the Mammographic Image Analysis Society mammographic database; digital mammograms from the local database; and thermography images of the breast. Also, we present a procedure for the automatic separation of the breast region from the mammograms. Computed features based on gray-level co-occurrence matrices are used to evaluate the effectiveness of textural information possessed by mass regions. A total of 20 texture features are extracted from the region of interest. The ability of feature set in differentiating abnormal from normal tissue is investigated using a support vector machine classifier, Naive Bayes classifier and K-Nearest Neighbor classifier. To evaluate the classification performance, five-fold cross-validation method and receiver operating characteristic analysis was performed.

Download Full-text

KLASIFIKASI STATUS PEMBAYARAN PREMI MENGGUNAKAN ALGORITMA NEIGHBOR WEIGHTED K-NEAREST NEIGHBOR (NWKNN) (STUDI KASUS: PT. BUMIPUTERA KOTA SAMARINDA)

VARIANCE : Journal of Statistics and Its Applications ◽

10.30598/variancevol1iss2page56-63 ◽

2020 ◽

Vol 1 (2) ◽

pp. 56-63

Author(s):

Grassella Gunsyang ◽

Ika Purnamasari ◽

Fidia Deny Tisna Amijaya

Keyword(s):

Cross Validation ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Fold Cross Validation

Algoritma Neighbor Weighted K-Nearest Neighbor (NWKNN) merupakan pengembangan dari algoritma K-Nearest Neighbor (KNN), dengan memberikan bobot pada setiap kelas yang akan diklasifikasikan. Penelitian ini membahas tentang klasifikasi menggunakan algoritma NWKNN yang diaplikasikan pada data status pembayaran premi. Tujuannya untuk mengetahui nilai eksponen (E) dan nilai ketetanggaan (K) yang optimal, serta nilai akurasi dari klasifikasi data status pembayaran Premi di PT. Bumiputera Kota Samarinda. Tahapan dalam penelitian ini yaitu menentukan nilai E dan nilai K menggunakan k-fold cross validation, menghitung jarak euclidean, menghitung bobot dan skor setiap kelas, melihat nilai skor terbesar untuk menentukan hasil klasifikasi, kemudian menghitung nilai akurasi klasifikasi. Hasil penelitian menunjukkan bahwa nilai K dan nilai E yang optimal untuk klasifikasi status pembayaran premi di PT. Bumiputera Kota Samarinda menggunakan NWKNN sebesar K=3 dan E=6 dengan nilai akurasi sebesar 75%.

Download Full-text

Phishing Website Detection Using Machine Learning Classifiers Optimized by Feature Selection

Traitement du signal ◽

10.18280/ts.370403 ◽

2020 ◽

Vol 37 (4) ◽

pp. 563-569

Author(s):

Dželila Mehanović ◽

Jasmin Kevrić

Keyword(s):

Feature Selection ◽

Random Forest ◽

Cross Validation ◽

Nearest Neighbor ◽

Security Threats ◽

Selection Methods ◽

K Nearest Neighbor ◽

Machine Learning Classifiers ◽

Time To Build ◽

Fold Cross Validation

Security is one of the most actual topics in the online world. Lists of security threats are constantly updated. One of those threats are phishing websites. In this work, we address the problem of phishing websites classification. Three classifiers were used: K-Nearest Neighbor, Decision Tree and Random Forest with the feature selection methods from Weka. Achieved accuracy was 100% and number of features was decreased to seven. Moreover, when we decreased the number of features, we decreased time to build models too. Time for Random Forest was decreased from the initial 2.88s and 3.05s for percentage split and 10-fold cross validation to 0.02s and 0.16s respectively.

Download Full-text