Kombinasi Feature Selection Fisher Score dan Principal Component Analysis (PCA) untuk Klasifikasi Cervix Dysplasia

Pengamatan citra Pap Smear merupakan langkah yang sangat penting dalam mendiagnosis awal terhadap gangguan servik. Pengamatan tersebut membutuhkan sumber daya yang besar. Dalam hal ini machine learning dapat mengatasi masalah tersebut. Akan tetapi, keakuratan machine learning bergantung pada fitur yang digunakan. Hanya fitur relevan dan diskriminatif yang mampu memberikan hasil klasifikasi akurat. Pada penelitian ini menggabungkan Fisher Score dan Principal Component Analysis (PCA). Pertama Fisher Score memilih fitur relevan berdasarkan perangkingan. Langkah selanjutnya PCA mentransformasikan kandidat fitur menjadi dataset baru yang tidak saling berkorelasi. Metode jaringan syaraf tiruan Backpropagation digunakan untuk mengevaluasi performa kombinasi Fisher Score dan PCA. Model dievaluasi dengan metode 5 fold cross validation. Selain itu kombinasi ini dibandingkan dengan model fitur asli dan model fitur hasil Fscore. Hasil percobaan menunjukkan kombinasi fisher score dan PCA menghasilkan performa terbaik (akurasi 0.964±0.006, Sensitivity 0.990±0.005 dan Specificity 0.889±0.009). Dari segi waktu komputasi, kombinasi Fisher Score dan PCA membutuhkan waktu relative cepat. Penelitian ini membuktikan bahwa penggunaan feature selection dan feature extraction mampu meningkatkan kinerja klasifikasi dengan waktu yang relative singkat. Abstract Examination Pap Smear images is an important step to early diagnose cervix dysplasia. It needs a lot of resources. In this case, Machine Learning can solve this problem. However, Machine learning depends on the features used. Only relevant and discriminant features can provide an accurate classification result. In this work, combining feature selection Fisher Score (FScore) and Principal Component Analysis (PCA) is applied. First, FScore selects relevant features based on rangking score. And then PCA transforms candidate features into a new uncorrelated dataset. Artificial Neural Network Backpropagation used to evaluate performance combination FScore PCA. The model evaluated with 5 fold cross validation. The other hand, this combination compared with original features model and FScore model. Experimental result shows the combination of Fscore PCA produced the best performance (Accuracy 0.964±0.006, Sensitivity 0.990±0.005 and Specificity 0.889±0.009). In term of computational time, this combination needed a reasonable time. In this work, it was proved that applying feature selection and feature extraction could improve performance classification with a promising time.

Download Full-text

Multivariate Analysis and Machine Learning for Ripeness Classification of Cape Gooseberry Fruits

Processes ◽

10.3390/pr7120928 ◽

2019 ◽

Vol 7 (12) ◽

pp. 928 ◽

Cited By ~ 2

Author(s):

Miguel De-la-Torre ◽

Omar Zatarain ◽

Himer Avila-George ◽

Mirna Muñoz ◽

Jimy Oblitas ◽

...

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Support Vector ◽

Color Spaces ◽

Combination Methods ◽

Fruit Samples ◽

Cape Gooseberry

This paper explores five multivariate techniques for information fusion on sorting the visual ripeness of Cape gooseberry fruits (principal component analysis, linear discriminant analysis, independent component analysis, eigenvector centrality feature selection, and multi-cluster feature selection.) These techniques are applied to the concatenated channels corresponding to red, green, and blue (RGB), hue, saturation, value (HSV), and lightness, red/green value, and blue/yellow value (L*a*b) color spaces (9 features in total). Machine learning techniques have been reported for sorting the Cape gooseberry fruits’ ripeness. Classifiers such as neural networks, support vector machines, and nearest neighbors discriminate on fruit samples using different color spaces. Despite the color spaces being equivalent up to a transformation, a few classifiers enable better performances due to differences in the pixel distribution of samples. Experimental results show that selection and combination of color channels allow classifiers to reach similar levels of accuracy; however, combination methods still require higher computational complexity. The highest level of accuracy was obtained using the seven-dimensional principal component analysis feature space.

Download Full-text

IPCARF: improving lncRNA-disease association prediction using incremental principal component analysis feature selection and a random forest classifier

BMC Bioinformatics ◽

10.1186/s12859-021-04104-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Rong Zhu ◽

Yong Wang ◽

Jin-Xing Liu ◽

Ling-Yun Dai

Keyword(s):

Principal Component Analysis ◽

Random Forest ◽

Cross Validation ◽

Search Algorithm ◽

Principal Component ◽

Component Analysis ◽

Biological Data ◽

Learning Technology ◽

Disease Associations ◽

Fold Cross Validation

Abstract Background Identifying lncRNA-disease associations not only helps to better comprehend the underlying mechanisms of various human diseases at the lncRNA level but also speeds up the identification of potential biomarkers for disease diagnoses, treatments, prognoses, and drug response predictions. However, as the amount of archived biological data continues to grow, it has become increasingly difficult to detect potential human lncRNA-disease associations from these enormous biological datasets using traditional biological experimental methods. Consequently, developing new and effective computational methods to predict potential human lncRNA diseases is essential. Results Using a combination of incremental principal component analysis (IPCA) and random forest (RF) algorithms and by integrating multiple similarity matrices, we propose a new algorithm (IPCARF) based on integrated machine learning technology for predicting lncRNA-disease associations. First, we used two different models to compute a semantic similarity matrix of diseases from a directed acyclic graph of diseases. Second, a characteristic vector for each lncRNA-disease pair is obtained by integrating disease similarity, lncRNA similarity, and Gaussian nuclear similarity. Then, the best feature subspace is obtained by applying IPCA to decrease the dimension of the original feature set. Finally, we train an RF model to predict potential lncRNA-disease associations. The experimental results show that the IPCARF algorithm effectively improves the AUC metric when predicting potential lncRNA-disease associations. Before the parameter optimization procedure, the AUC value predicted by the IPCARF algorithm under 10-fold cross-validation reached 0.8529; after selecting the optimal parameters using the grid search algorithm, the predicted AUC of the IPCARF algorithm reached 0.8611. Conclusions We compared IPCARF with the existing LRLSLDA, LRLSLDA-LNCSIM, TPGLDA, NPCMF, and ncPred prediction methods, which have shown excellent performance in predicting lncRNA-disease associations. The compared results of 10-fold cross-validation procedures show that the predictions of the IPCARF method are better than those of the other compared methods.

Download Full-text

Deteksi Penyakit Kanker Payudara dengan Seleksi Fitur berbasis Principal Component Analysis dan Random Forest

Jurnal Infortech ◽

10.31294/infortech.v2i1.8079 ◽

2020 ◽

Vol 2 (1) ◽

pp. 96-101

Author(s):

Ahmad Fauzi ◽

Riki Supriyadi ◽

Nurlaelatul Maulidah

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Random Forest ◽

Cross Validation ◽

Principal Component ◽

Component Analysis ◽

Data Set ◽

Fold Cross Validation

Abstrak - Skrining merupakan upaya deteksi dini untuk mengidentifikasi penyakit atau kelainan yang secara klinis belum jelas dengan menggunakan tes, pemeriksaan atau prosedur tertentu. Upaya ini dapat digunakan secara cepat untuk membedakan orang - orang yang kelihatannya sehat tetapi sesungguhnya menderita suatu kelainan.Tujuan utama penelitian ini adalah untuk meningkatkan peforma klasifikasi pada diagnosis kanker payudara dengan menerapkan seleksi fitur pada beberapa algoritme klasifikasi. Penelitian ini menggunakan database kanker payudara Breast Cancer Coimbra Data Set . Metode seleksi fitur berbasis pricipal component analysis akan dipasangkan dengan beberapa algoritme klasifikasi dan metode, seperti Logitboost,Bagging,dan Random Forest. Penelitian ini menggunakan 10 fold cross validation sebagai metode evaluasi. Hasil penelitian menunjukkan metode seleksi fitur berbasis pricipal component analysis mengalami peningkatan peforma klasifikasi secara signifikan setelah dipasangkan dengan seleksi fitur Random Forest dan logitboost, Random forest menunjukan peforma terbaik dengan akurasi 79.3103% dengan nilai AUC sebesar 0,843. Kata Kunci: Seleksi Fitur,PCA, Kanker Payudara,Skrining,Random Forest

Download Full-text

Classification of Observations through Combination of the Dimension Reduction and the Cluster Analysis

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.13 ◽

2017 ◽

Vol 7 (8) ◽

pp. 30

Author(s):

Hyeuk Kim

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Cluster Analysis ◽

Unsupervised Learning ◽

Principal Component ◽

Component Analysis ◽

Baseball Players ◽

Partitioning Around Medoids ◽

Different Characteristics

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

Analysis of the Bath Motion in the MM-SQC Dynamics Using Unsupervised Machine Learning Dimensionality Reduction Approaches: Principal Component Analysis

10.26434/chemrxiv.13332530 ◽

2020 ◽

Author(s):

Jiawei Peng ◽

Yu Xie ◽

Deping Hu ◽

Zhenggang Lan

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Collective Motion ◽

Principal Component ◽

Component Analysis ◽

Nonadiabatic Dynamics ◽

Trajectory Data ◽

Unsupervised Machine Learning ◽

Physical Knowledge ◽

Vibronic Couplings

The system-plus-bath model is an important tool to understand nonadiabatic dynamics for large molecular systems. The understanding of the collective motion of a huge number of bath modes is essential to reveal their key roles in the overall dynamics. We apply the principal component analysis (PCA) to investigate the bath motion based on the massive data generated from the MM-SQC (symmetrical quasi-classical dynamics method based on the Meyer-Miller mapping Hamiltonian) nonadiabatic dynamics of the excited-state energy transfer dynamics of Frenkel-exciton model. The PCA method clearly clarifies that two types of bath modes, which either display the strong vibronic couplings or have the frequencies close to electronic transition, are very important to the nonadiabatic dynamics. These observations are fully consistent with the physical insights. This conclusion is obtained purely based on the PCA understanding of the trajectory data, without the large involvement of pre-defined physical knowledge. The results show that the PCA approach, one of the simplest unsupervised machine learning methods, is very powerful to analyze the complicated nonadiabatic dynamics in condensed phase involving many degrees of freedom.

Download Full-text