scholarly journals Predictor Selection for Bacterial Vaginosis Diagnosis Using Decision Tree and Relief Algorithms

2020 ◽  
Vol 10 (9) ◽  
pp. 3291
Author(s):  
Jesús F. Pérez-Gómez ◽  
Juana Canul-Reich ◽  
José Hernández-Torruco ◽  
Betania Hernández-Ocaña

Requiring only a few relevant characteristics from patients when diagnosing bacterial vaginosis is highly useful for physicians as it makes it less time consuming to collect these data. This would result in having a dataset of patients that can be more accurately diagnosed using only a subset of informative or relevant features in contrast to using the entire set of features. As such, this is a feature selection (FS) problem. In this work, decision tree and Relief algorithms were used as feature selectors. Experiments were conducted on a real dataset for bacterial vaginosis with 396 instances and 252 features/attributes. The dataset was obtained from universities located in Baltimore and Atlanta. The FS algorithms utilized feature rankings, from which the top fifteen features formed a new dataset that was used as input for both support vector machine (SVM) and logistic regression (LR) algorithms for classification. For performance evaluation, averages of 30 runs of 10-fold cross-validation were reported, along with balanced accuracy, sensitivity, and specificity as performance measures. A performance comparison of the results was made between using the total number of features against using the top fifteen. These results found similar attributes from our rankings compared to those reported in the literature. This study is part of ongoing research that is investigating a range of feature selection and classification methods.

2010 ◽  
Vol 22 (02) ◽  
pp. 119-125 ◽  
Author(s):  
Hui-Huang Hsu ◽  
Cheng-Wei Hsieh

Determining the structure of a protein is not an easy task, which usually involved a time-consuming and costly process in the web lab. Using computational methods to predict a protein's tertiary structure from its primary structure (the amino acid sequence) is desirable. Disordered regions are segments of a protein that do not have a fixed conformation, which makes the structure prediction harder. Also, these disordered regions are functionally important for a protein. In this research, we would like to identify such regions with a focus on selecting a proper feature set. Three feature selection methods, namely F-score, information gain (IG), and k-medoids clustering, are used for feature selection. The support vector machine (SVM) is then used for classification. The results show that the classification accuracy can be raised with a smaller feature set. The k-medoids clustering feature selection can reduce the number of features from 440 to 150 and improve the accuracy from 84.66 to 86.81% in five-fold cross validation. It also has a more stable performance than F-score and IG.


2004 ◽  
Vol 13 (04) ◽  
pp. 791-800 ◽  
Author(s):  
HOLGER FRÖHLICH ◽  
OLIVIER CHAPELLE ◽  
BERNHARD SCHÖLKOPF

The problem of feature selection is a difficult combinatorial task in Machine Learning and of high practical relevance, e.g. in bioinformatics. Genetic Algorithms (GAs) offer a natural way to solve this problem. In this paper we present a special Genetic Algorithm, which especially takes into account the existing bounds on the generalization error for Support Vector Machines (SVMs). This new approach is compared to the traditional method of performing cross-validation and to other existing algorithms for feature selection.


2020 ◽  
Vol 23 (65) ◽  
pp. 100-114
Author(s):  
Supoj Hengpraprohm ◽  
Suwimol Jungjit

For breast cancer data classification, we propose an ensemble filter feature selection approach named ‘EnSNR’. Entropy and SNR evaluation functions are used to find the features (genes) for the EnSNR subset. A Genetic Algorithm (GA) generates the classification ‘model’. The efficiency of the ‘model’ is validated using 10-Fold Cross-Validation re-sampling. The Microarray dataset used in our experiments contains 50,739 genes for each of 32 patients. When our proposed ‘EnSNR’ subset of features is used; as well as giving an enhanced degree of prediction accuracy and reducing the number of irrelevant features (genes), there is also a small saving of computer processing time.


Molecules ◽  
2018 ◽  
Vol 23 (8) ◽  
pp. 2000 ◽  
Author(s):  
Jiu-Xin Tan ◽  
Fu-Ying Dao ◽  
Hao Lv ◽  
Peng-Mian Feng ◽  
Hui Ding

Accurate identification of phage virion protein is not only a key step for understanding the function of the phage virion protein but also helpful for further understanding the lysis mechanism of the bacterial cell. Since traditional experimental methods are time-consuming and costly for identifying phage virion proteins, it is extremely urgent to apply machine learning methods to accurately and efficiently identify phage virion proteins. In this work, a support vector machine (SVM) based method was proposed by mixing multiple sets of optimal g-gap dipeptide compositions. The analysis of variance (ANOVA) and the minimal-redundancy-maximal-relevance (mRMR) with an increment feature selection (IFS) were applied to single out the optimal feature set. In the five-fold cross-validation test, the proposed method achieved an overall accuracy of 87.95%. We believe that the proposed method will become an efficient and powerful method for scientists concerning phage virion proteins.


Author(s):  
Fatmawati Fatmawati ◽  
Muhammad Affandes

Abstrak – Facebook Group iRaise Helpdesk merupakan salah satu layanan media sosial yang digunakan pihak PTIPD UIN Suska Riau sebagai layanan pelanggan (customer services) sistem akademik. Mengingat sistem akademik baru mengalami peralihan yang sebelumnya bernama SIMAK menjadi iRaise, sehingga masih ada permasalahan yang ditimbulkan, dan menjadi keluhan bagi penggunanya.  Untuk pengolahan data keluhan, pihak PTIPD masih menggunakan proses manual dengan menggunakan microsoft word dan excel. Sehingga pada penelitian ini akan dilakukan pengklasifikasian permasalahan sistem iRaise pada kategori multiclass yaitu: login, krs, nilai dan personal. Dengan menggunakan metode Support Vector Machine (SVM) dengan kernel RBF.  Jumlah dataset sebanyak 1040 data keluhan. Pengujian dilakukan menggunakan aplikasi RapidMiner dan diuji dengan menggunakan 10-Fold cross validation dan diukur dengan confussion matrix. Dari hasil uji coba aplikasi menunjukkan akurasi tertinggi sebesar 95.67% pengujian tanpa menggunakan feature selection pada titik C=2 dan .Kata Kunci : confussion matrix, cross validation, iraise, keluhan, klasifikasi, rapidminer, support vector machine.


2021 ◽  
Vol 2 (2) ◽  
pp. 112-122
Author(s):  
Novanto Yudistira ◽  
Aldi Fianda Putra

Serangan jantung atau dalam medis bernama Myocardial Infarction atau infark miokard adalah gangguan jantung yang sangat serius. Dalam pendeteksian ini menggunakan komplikasi-komplikasi yang diderita oleh pasien. Algoritma yang akan dievaluasi yaitu Naive Bayes, Decision Tree, dan Support Vector Machine. Namun tidak serta merta dapat dilakukan evaluasi. Sebelum mengevaluasi ketiga algoritma ini dilakukan perbaikan dataset, karena pada dataset ini sendiri terdapat data yang kosong. Perbaikan dilakukan dengan cara mengimputasikan data dimana nilai diperkirakan berdasarkan rata-rata dari anggota klaster pada kelas yang sama. Setelah melakukan imputasi data, maka dapat dilakukan normalisasi dengan metode MinMax dengan tujuan agar rentang fitur terutama data numerik kontinu tidak terlalu besar. Setelah pemrosesan data awal dilakukan maka barulah kita dapat melakukan evaluasi dengan menggunakan metode K-fold Cross Validation. Namun lagi-lagi ditemukan kesalahan yakni data latih yang digunakan ternyata tidak seimbang. Oleh sebab itu dilakukan oversampling pada data agar data menjadi seimbang. Setelah seimbang maka kita dapat melakukan evaluasi kembali dan diperolehlah algoritma yang cocok untuk mengklasifikasikan data seperti dataset Myocardial Infarction Complications adalah algoritma Decision Tree dengan akurasi 98%, diikuti algoritma Support Vector Machine dengan akurasi 91% dan Naïve Bayes dengan akurasi paling rendah yakni 49%.


2018 ◽  
Vol 1 (1) ◽  
pp. 120-130 ◽  
Author(s):  
Chunxiang Qian ◽  
Wence Kang ◽  
Hao Ling ◽  
Hua Dong ◽  
Chengyao Liang ◽  
...  

Support Vector Machine (SVM) model optimized by K-Fold cross-validation was built to predict and evaluate the degradation of concrete strength in a complicated marine environment. Meanwhile, several mathematical models, such as Artificial Neural Network (ANN) and Decision Tree (DT), were also built and compared with SVM to determine which one could make the most accurate predictions. The material factors and environmental factors that influence the results were considered. The materials factors mainly involved the original concrete strength, the amount of cement replaced by fly ash and slag. The environmental factors consisted of the concentration of Mg2+, SO42-, Cl-, temperature and exposing time. It was concluded from the prediction results that the optimized SVM model appeared to perform better than other models in predicting the concrete strength. Based on SVM model, a simulation method of variables limitation was used to determine the sensitivity of various factors and the influence degree of these factors on the degradation of concrete strength.


Sign in / Sign up

Export Citation Format

Share Document