scholarly journals Metrics Based Feature Selection for Software Defect Prediction

2020 ◽  
Vol 8 (2) ◽  
Author(s):  
Radityo Adi Nugroho ◽  
Friska Abadi ◽  
M. Reza Faisal ◽  
Rudy Herteno ◽  
Rahmat Ramadhani

Nowadays, software is very influential on various sectors of life, both to solve business needs, as well as personal needs. To have a Software with high quality, testing is needed to avoid software defect. Research on software defects involving Machine Learning is currently being carried out by many researchers. This method contains one important step, which is called feature selection. In this study, researchers conducted a feature selection based on the software metric category to determine the level of accuracy of the prediction of software defects by utilizing 13 (thirteen) datasets from NASA MDP namely CM1, JM1, KC1, KC3, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. To classify, the researchers involved 5 (five) classifiers, namely Naive Bayes, Decision Trees, Random Forests, K-Nearest Neighbor, and Support Vector Machines. The research result shows that each attribure on software metric categories has effect on each dataset. Naive Bayes Algorithm and Random Forest Algorithm can give better performance than other algorithm in classifieng software defect with feature selection based on metrics. On the other hand, the best metrics category on each classifier algorithm is metric Misc. From average AUC value, it can be concluded that metrics category which can give best performance is metric LoC, followed by metric Misc. Both categories have achieved highest AUC value in Random Forest classifier.

2020 ◽  
Vol 4 (3) ◽  
pp. 504-512
Author(s):  
Faried Zamachsari ◽  
Gabriel Vangeran Saragih ◽  
Susafa'ati ◽  
Windu Gata

The decision to move Indonesia's capital city to East Kalimantan received mixed responses on social media. When the poverty rate is still high and the country's finances are difficult to be a factor in disapproval of the relocation of the national capital. Twitter as one of the popular social media, is used by the public to express these opinions. How is the tendency of community responses related to the move of the National Capital and how to do public opinion sentiment analysis related to the move of the National Capital with Feature Selection Naive Bayes Algorithm and Support Vector Machine to get the highest accuracy value is the goal in this study. Sentiment analysis data will take from public opinion using Indonesian from Twitter social media tweets in a crawling manner. Search words used are #IbuKotaBaru and #PindahIbuKota. The stages of the research consisted of collecting data through social media Twitter, polarity, preprocessing consisting of the process of transform case, cleansing, tokenizing, filtering and stemming. The use of feature selection to increase the accuracy value will then enter the ratio that has been determined to be used by data testing and training. The next step is the comparison between the Support Vector Machine and Naive Bayes methods to determine which method is more accurate. In the data period above it was found 24.26% positive sentiment 75.74% negative sentiment related to the move of a new capital city. Accuracy results using Rapid Miner software, the best accuracy value of Naive Bayes with Feature Selection is at a ratio of 9:1 with an accuracy of 88.24% while the best accuracy results Support Vector Machine with Feature Selection is at a ratio of 5:5 with an accuracy of 78.77%.


2021 ◽  
Vol 2 (2) ◽  
pp. 96-104
Author(s):  
REYNALDA NABILA CIKANIA

Halodoc is a telemedicine-based healthcare application that connects patients with health practitioners such as doctors, pharmacies, and laboratories. There are some comments from halodoc users, both positive and negative comments. This indicates the public's concern for the Halodoc application so it is necessary to analyze the sentiment or comments that appear on the Halodoc application service, especially during the COVID-19 pandemic in order for Halodoc application services to be better. The Naïve Bayes Classifier (NBC) and Support Vector Machine (SVM) algorithms are used to analyze the public sentiment of Halodoc's telemedicine service application users. The negative category sentiment classification result was 12.33%, while the positive category sentiment was 87.67% from 5,687 reviews which means that the positive review sentiment is more than the negative review sentiment. The accuracy performance of the Naive Bayes Classifier Algorithm resulted in an accuracy rate of 87.77% with an AUC value of 57.11% and a G-Mean of 40.08%, while svm algorithm with KERNEL RBF had an accuracy value of 86.1% with an AUC value of 60.149% and a G-Mean value of 49.311%. Based on the accuracy value of the model can be known SVM Kernel RBF model better than NBC on classifying the review of user sentiment of halodoc telemedicine service


PLoS ONE ◽  
2014 ◽  
Vol 9 (1) ◽  
pp. e86703 ◽  
Author(s):  
Wangchao Lou ◽  
Xiaoqing Wang ◽  
Fan Chen ◽  
Yixiao Chen ◽  
Bo Jiang ◽  
...  

Author(s):  
Anirudh Reddy Cingireddy ◽  
Robin Ghosh ◽  
Supratik Kar ◽  
Venkata Melapu ◽  
Sravanthi Joginipeli ◽  
...  

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.


2016 ◽  
Vol 22 (4) ◽  
pp. 751-773 ◽  
Author(s):  
Carolina Gusmão Souza ◽  
Luis Carvalho ◽  
Polyanne Aguiar ◽  
Tássia Borges Arantes

A cafeicultura é uma das principais culturas agrícolas do Brasil e realizar o mapeamento e monitoramento desta cultura é fundamental para conhecer sua distribuição espacial. Porém, mapear estas áreas utilizando imagens de Sensoriamento Remoto não é uma tarefa fácil. Sendo assim, este trabalho foi realizado com o objetivo de comparar o uso de diferentes variáveis e algoritmos de classificação para o mapeamento de áreas cafeeiras. O trabalho foi desenvolvido em três áreas diferentes, que são bastante significativas na produção de café. Foram utilizados 5 algoritmos de aprendizagem de máquinas e 7 combinações de variáveis: espectrais, texturais e geométricas, associadas ao processo de classificação. Um total de 105 classificações foram realizadas, 35 classificações para cada uma das áreas. As classificações que não usaram variáveis espectrais não resultaram em bons índices de acurácia. Nas três áreas, o algoritmo que apresentou as melhores acurácias foi o Support vector machine, com acurácia global de 85,33% em Araguari, 87% em Carmo de Minas e 88,33% em Três Pontas. Os piores resultados foram encontrados com o algoritmo Random Forest em Araguari, com acurácia global de 76,66% e com o Naive Bayes em Carmo de Minas e Três Pontas, com 76% e 82% de acerto. Nas três áreas, variáveis texturais, quando associadas às espectrais, melhoraram a acurácia da classificação. O SVM apresentou o melhor desempenho para as três áreas


2021 ◽  
Vol 10 (3) ◽  
pp. 432-437
Author(s):  
Devi Irawan ◽  
Eza Budi Perkasa ◽  
Yurindra Yurindra ◽  
Delpiah Wahyuningsih ◽  
Ellya Helmud

Short message service (SMS) adalah salah satu media komunikasi yang penting untuk mendukung kecepatan pengunaan ponsel oleh pengguna. Sistem hibrid klasifikasi SMS digunakan untuk mendeteksi sms yang dianggap sampah dan benar. Dalam penelitian ini yang diperlukan adalah mengumpulan dataset SMS, pemilihan fitur, prapemrosesan, pembuatan vektor, melakukan penyaringan dan pembaharuan sistem. Dua jenis klasifikasi SMS pada ponsel saat ini ada yang terdaftar sebagai daftar hitam (ditolak) dan daftar putih (diterima). Penelitian ini menggunakan beberapa algoritma seperti support vector machine, Naïve Bayes classifier, Random Forest dan Bagging Classifier. Tujuan dari penelitian ini adalah untuk menyelesaikan semua masalah SMS yang teridentifikasi spam yang banyak terjadi pada saat ini sehingga dapat memberikan masukan dalam perbandingan metode yang mampu menyaring dan memisahkan sms spam dan sms non spam.  Pada penelitian ini menghasilkan bahwa Bagging classifier algorithm ini mendapatkan ferformance score tertinggi dari algoritma yang lain yang dapat dipergunakan sebagai sarana untuk memfiltrasi SMS yang masuk ke dalam inbox pengguna dan Bagging classifier algorithm dapat memberikan hasil filtrasi yang akurat untuk menyaring SMS yang masuk.


2021 ◽  
Author(s):  
Ιωάννης Μήνου

Η μεγαλύτερη πρόκληση των σύγχρονων υπολογιστικών συστημάτων είναι αναμφισβήτητα η αποδοτική αποθήκευση και ανάκτηση πολύ μεγάλου όγκου δεδομένων. Η ανάγκη αυτή έκανε την εμφάνισή της τα τελευταία χρόνια λόγω της έκρηξης δεδομένων που παρατηρείται στο διαδίκτυο και αποκτά ολοένα και μεγαλύτερη σημασία λόγω του πολύ μεγάλου εύρους πληροφοριών που μπορούμε να αντλήσουμε. Ο τομέας της υγειονομικής περίθαλψης και των ιατρικών δεδομένων είναι συνεχώς και ταχέως εξελισσόμενος. Η αξιοποίηση των Big Data στο χώρο της υγείας προσφέρει πολύτιμη πληροφόρηση καθώς παρουσιάζουν απεριόριστες δυνατότητες για αποτελεσματική αποθήκευση, επεξεργασία, sql queries και ανάλυση ιατρικών δεδομένων.Σκοπός της παρούσας διατριβής είναι η μελέτη τεχνικών εξόρυξης γνώσης για δεδομένα μεγάλου όγκου, που αφορούν το πεδίο της Υγείας. Παράλληλα σκοπός της έρευνας είναι η μελέτη στατιστικών και υπολογιστικών αλγορίθμων ανάλυσης μεγάλου όγκου δεδομένων υγείας που έχουν ως αποτέλεσμα την παραγωγή νέας γνώσης καθώς και την εξαγωγή στατιστικά σημαντικής πληροφορίας για τους επαγγελματίες υγείας. Τέλος, η παρούσα διατριβή διερευνά τις γνώσεις των επιστημόνων της Πληροφορικής Υγείας και των επαγγελματιών υγείας σχετικά με τα Big Data.Στην παρούσα διδακτορική διατριβή έγινε βιβλιογραφική ανασκόπηση της έννοιας των Big Data. Η ανασκόπηση αυτή περιλαμβάνει τον ορισμό των Big Data ,τα χαρακτηριστικά τους, τα πλεονεκτήματα και τα μειονεκτήματά τους στο χώρο της υγείας. Στη συνέχεια γίνεται αναφορά στην υλοποίηση και στους μηχανισμούς αποθήκευσης των Big Data. Επιπλέον γίνεται αναφορά στα συστήματα ανάλυσης και επεξεργασίας μεγάλου όγκου δεδομένων, στις γλώσσες προγραμματισμού για Big Data, στην εξόρυξη γνώσης δεδομένων στο χώρο της υγείας. Ακόμη γίνεται αναφορά στη χρήση των Big Data στην Ευρώπη και στον κόσμο. Τέλος παρουσιάζονται οι βασικές αρχές του GDPR καθώς και το πώς σχετίζεται με τα Big Data στο χώρο της υγείας. Επίσης διεξήχθησαν δύο εμπειρικές μελέτες.Η πρώτη μελέτη είχε σαν στόχο την καταγραφή της άποψης των επιστημόνων της Πληροφορικής Υγείας σχετικά με την τεχνολογία των Big Data. Η συλλογή των δεδομένων έγινε με χρήση ερωτηματολογίου. Η στατιστική ανάλυση έδειξε τη θετική ανταπόκριση του δείγματος σχετικά με την τεχνολογία των Big Data.Η δεύτερη μελέτη είχε σαν στόχο την καταγραφή της άποψης των Επαγγελματιών Υγείας σχετικά με την τεχνολογία των Big Data. Η συλλογή των δεδομένων έγινε με χρήση ερωτηματολογίου. Η στατιστική ανάλυση δεν έδωσε επαρκείς απαντήσεις καθώς οι ερωτηθέντες έδειξαν θετική στάση απέναντι στα Big Data ενώ απάντησαν ότι δεν γνωρίζουν πολλά για τη συγκεκριμένη τεχνολογία.Το τελευταίο κομμάτι της διατριβής περιλαμβάνει την ανάπτυξη μεθόδων πρόβλεψης για την δυνατότητα διάγνωσης των ασθενών με καρδιαγγειακά νοσήματα. Οι μέθοδοι πρόβλεψης που χρησιμοποιήθηκαν είναι: Λογιστική Παλινδρόμηση, Naive Bayes Classifier, Δένδρα αποφάσεων, Αλγόριθμος Κ κοντινότερων γειτόνων, Αλγόριθμος SVM (Support Vector Machine) και Random Forest. Η ανάπτυξη περιλάμβανε όλα τα στάδια προεπεξεργασίας των δεδομένων ενώ χρησιμοποιήθηκαν συγκεκριμένες μετρικές για τη μέτρηση της απόδοσης των κατηγοριοποιητών. Τέλος έγιναν βελτιώσεις της απόδοσης των κατηγοριοποιητών χρησιμοποιώντας διασταυρωτική επαλήθευση με την μέθοδο cross-validation ενώ επιλύθηκε και το πρόβλημα της ανισορροπίας των κλάσεων χρησιμοποιώντας τη μέθοδο SMOTE.


2020 ◽  
Vol 8 (2) ◽  
pp. 91-100
Author(s):  
Muhamad Azhar ◽  
Noor Hafidz ◽  
Biktra Rudianto ◽  
Windu Gata

Abstract   Technology implementation in the marketplace world has attracted the attention of researchers to analyze the reviews from customers. The Klik Indomaret application page on GooglePlay is one application that can be used to get information on review data collection. However, getting information on consumer’s opinion or review is not an easy task and need a specific method in categorizing or grouping these reviews into certain groups, i.e. positive or negative reviews. The sentiment analysis study of a review application in GooglePlay is still rare. Therefore, this paper analysis the customer’s sentiment from klikindomaret app using Naive Bayes Classifier (NB) algorithm that is compared to Support Vector Machine (SVM) as well as optimizing the Feature Selection (FS) using the Particle Swarm Optimization method. The results for NB without using FS optimization were 69.74% for accuracy and 0.518 for Area Under Curve (AUC) and for SVM without using FS optimization were 81.21% for accuracy and 0.896 for AUC. While the results of cross-validation NB with FS are 75.21% for accuracy and 0.598 for AUC and cross-validation of SVM with FS is 81.84% for accuracy and 0.898 for AUC, while there is an increase when using the Feature Selection (FS) Particle Swarm Optimization and also the modeling algorithm SVM has a higher value compared to NB for the dataset used in this study.   Keywords: Naive Bayes, Particle Swarm Optimization, Support Vector Machine, Feature Selection, Consumer Review.


Sign in / Sign up

Export Citation Format

Share Document