scholarly journals KOMBINASI MEDIAN WEIGHTED INFORMATION GAIN DENGAN K-NEAREST NEIGHBOR PADA DATASET LABEL MONTHS SOFTWARE EFFORT ESTIMATION

2020 ◽  
Vol 14 (2) ◽  
pp. 138
Author(s):  
Indra Kurniawan
2020 ◽  
Vol 17 (1) ◽  
pp. 319-328
Author(s):  
Ade Muchlis Maulana Anwar ◽  
Prihastuti Harsani ◽  
Aries Maesya

Population Data is individual data or aggregate data that is structured as a result of Population Registration and Civil Registration activities. Birth Certificate is a Civil Registration Deed as a result of recording the birth event of a baby whose birth is reported to be registered on the Family Card and given a Population Identification Number (NIK) as a basis for obtaining other community services. From the total number of integrated birth certificate reporting for the 2018 Population Administration Information System (SIAK) totaling 570,637 there were 503,946 reported late and only 66,691 were reported publicly. Clustering is a method used to classify data that is similar to others in one group or similar data to other groups. K-Nearest Neighbor is a method for classifying objects based on learning data that is the closest distance to the test data. k-means is a method used to divide a number of objects into groups based on existing categories by looking at the midpoint. In data mining preprocesses, data is cleaned by filling in the blank data with the most dominating data, and selecting attributes using the information gain method. Based on the k-nearest neighbor method to predict delays in reporting and the k-means method to classify priority areas of service with 10,000 birth certificate data on birth certificates in 2019 that have good enough performance to produce predictions with an accuracy of 74.00% and with K = 2 on k-means produces a index davies bouldin of 1,179.


Author(s):  
Kiran Marri ◽  
Ramakrishnan Swaminathan

Muscle fatigue is a neuromuscular condition experienced during daily activities. This phenomenon is generally characterized using surface electromyography (sEMG) signals and has gained a lot of interest in the fields of clinical rehabilitation, prosthetics control, and sports medicine. sEMG signals are complex, nonstationary and also exhibit self-similarity fractal characteristics. In this work, an attempt has been made to differentiate sEMG signals in nonfatigue and fatigue conditions during dynamic contraction using multifractal analysis. sEMG signals are recorded from biceps brachii muscles of 42 healthy adult volunteers while performing curl exercise. The signals are preprocessed and segmented into nonfatigue and fatigue conditions using the first and last curls, respectively. The multifractal detrended moving average algorithm (MFDMA) is applied to both segments, and multifractal singularity spectrum (SSM) function is derived. Five conventional features are extracted from the singularity spectrum. Twenty-five new features are proposed for analyzing muscle fatigue from the multifractal spectrum. These proposed features are adopted from analysis of sEMG signals and muscle fatigue studies performed in time and frequency domain. These proposed 25 feature sets are compared with conventional five features using feature selection methods such as Wilcoxon rank sum, information gain (IG) and genetic algorithm (GA) techniques. Two classification algorithms, namely, k-nearest neighbor (k-NN) and logistic regression (LR), are explored for differentiating muscle fatigue. The results show that about 60% of the proposed features are statistically highly significant and suitable for muscle fatigue analysis. The results also show that eight proposed features ranked among the top 10 features. The classification accuracy with conventional features in dynamic contraction is 75%. This accuracy improved to 88% with k-NN-GA combination with proposed new feature set. Based on the results, it appears that the multifractal spectrum analysis with new singularity features can be used for clinical evaluation in varied neuromuscular conditions, and the proposed features can also be useful in analyzing other physiological time series.


2014 ◽  
Vol 701-702 ◽  
pp. 110-113
Author(s):  
Qi Rui Zhang ◽  
He Xian Wang ◽  
Jiang Wei Qin

This paper reports a comparative study of feature selection algorithms on a hyperlipimedia data set. Three methods of feature selection were evaluated, including document frequency (DF), information gain (IG) and aχ2 statistic (CHI). The classification systems use a vector to represent a document and use tfidfie (term frequency, inverted document frequency, and inverted entropy) to compute term weights. In order to compare the effectives of feature selection, we used three classification methods: Naïve Bayes (NB), k Nearest Neighbor (kNN) and Support Vector Machines (SVM). The experimental results show that IG and CHI outperform significantly DF, and SVM and NB is more effective than KNN when macro-averagingF1 measure is used. DF is suitable for the task of large text classification.


2022 ◽  
Vol 8 (1) ◽  
pp. 50
Author(s):  
Rifki Indra Perwira ◽  
Bambang Yuwono ◽  
Risya Ines Putri Siswoyo ◽  
Febri Liantoni ◽  
Hidayatulah Himawan

State universities have a library as a facility to support students’ education and science, which contains various books, journals, and final assignments. An intelligent system for classifying documents is needed to ease library visitors in higher education as a form of service to students. The documents that are in the library are generally the result of research. Various complaints related to the imbalance of data texts and categories based on irrelevant document titles and words that have the ambiguity of meaning when searching for documents are the main reasons for the need for a classification system. This research uses k-Nearest Neighbor (k-NN) to categorize documents based on study interests with information gain features selection to handle unbalanced data and cosine similarity to measure the distance between test and training data. Based on the results of tests conducted with 276 training data, the highest results using the information gain selection feature using 80% training data and 20% test data produce an accuracy of 87.5% with a parameter value of k=5. The highest accuracy results of 92.9% are achieved without information gain feature selection, with the proportion of training data of 90% and 10% test data and parameters k=5, 7, and 9. This paper concludes that without information gain feature selection, the system has better accuracy than using the feature selection because every word in the document title is considered to have an essential role in forming the classification.


2018 ◽  
Vol 150 ◽  
pp. 06006 ◽  
Author(s):  
Rozlini Mohamed ◽  
Munirah Mohd Yusof ◽  
Noorhaniza Wahidi

Feature selection is a process to select the best feature among huge number of features in dataset, However, the problem in feature selection is to select a subset that give the better performs under some classifier. In producing better classification result, feature selection been applied in many of the classification works as part of preprocessing step; where only a subset of feature been used rather than the whole features from a particular dataset. This procedure not only can reduce the irrelevant features but in some cases able to increase classification performance due to finite sample size. In this study, Chi-Square (CH), Information Gain (IG) and Bat Algorithm (BA) are used to obtain the subset features on fourteen well-known dataset from various applications. To measure the performance of these selected features three benchmark classifier are used; k-Nearest Neighbor (kNN), Naïve Bayes (NB) and Decision Tree (DT). This paper then analyzes the performance of all classifiers with feature selection in term of accuracy, sensitivity, F-Measure and ROC. The objective of these study is to analyse the outperform feature selection techniques among conventional and heuristic techniques in various applications.


Author(s):  
Abdulfatai Ganiyu Oladepo ◽  
Amos Orenyi Bajeh ◽  
Abdullateef Oluwagbemiga Balogun ◽  
Hammed Adeleye Mojeed ◽  
Abdulsalam Abiodun Salman ◽  
...  

This study presents a novel framework based on a heterogeneous ensemble method and a hybrid dimensionality reduction technique for spam detection in micro-blogging social networks. A hybrid of Information Gain (IG) and Principal Component Analysis (PCA) (dimensionality reduction) was implemented for the selection of important features and a heterogeneous ensemble consisting of Naïve Bayes (NB), K Nearest Neighbor (KNN), Logistic Regression (LR) and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) classifiers based on Average of Probabilities (AOP) was used for spam detection. The proposed framework was applied on MPI_SWS and SAC’13 Tip spam datasets and the developed models were evaluated based on accuracy, precision, recall, f-measure, and area under the curve (AUC). From the experimental results, the proposed framework (that is, Ensemble + IG + PCA) outperformed other experimented methods on studied spam datasets. Specifically, the proposed method had an average accuracy value of 87.5%, an average precision score of 0.877, an average recall value of 0.845, an average F-measure value of 0.872 and an average AUC value of 0.943. Also, the proposed method had better performance than some existing methods. Consequently, this study has shown that addressing high dimensionality in spam datasets, in this case, a hybrid of IG and PCA with a heterogeneous ensemble method can produce a more effective method for detecting spam contents.


Author(s):  
Andrijana Kos Kavran ◽  
Bruno Trstenjak

The purpose of this chapter is to investigate the impact of augmented reality experiential marketing (AREM) on tourist experience satisfaction. The chapter adds to the existing body of literature in the area of tourist experience satisfaction and its attributes and the use of augmented reality in the scope of experiential marketing. An experiment using an augmented reality system was conducted, which included a sample of 432 tourists who visited a tourist destination in Croatia. The data were tested using machine learning methods, namely information gain (IG) technique, K-means method, weighted K nearest neighbor (WKNN) method, and linear regression (LR) method. Findings indicate that augmented reality experiential marketing has a positive impact on tourist experience satisfaction.


2021 ◽  
Vol 5 (1) ◽  
pp. 26-48
Author(s):  
Olasehinde Olayemi Oladimeji ◽  
Alese Boniface Kayode ◽  
Adetunmbi Adebayo Olusola ◽  
Aladesote Olomi Isaiah

The significant rise in the frequency and sophistication of cyber-attacks and their diversity necessitated various researchers to develop strong and effective approaches to address recurring cyber threat challenges. This study evaluated the performance of three selected meta-learning models for optimal multi-class detection of cyber-attacks using the University of New South Wales 2015 Network benchmark (UNSW-NB15) Intrusion Dataset. The results of this study show and confirm the ability of the three base models; Naive Bayes, C4.5 Decision Tree, and K-Nearest Neighbor for solving multi-class problems. It further affirms the knack of the duo of feature selection techniques and stacked ensemble learning to optimize ML models' performances. The stacking of the predictions of the information gain base models with Model Decision Tree meta-algorithm recorded the most improved and optimal cyber-attacks detection accuracy and Mattew's correlation Coefficient than the stacking with the Multiple Model Trees (MMT) and Multi Response Linear regression (MLR) Meta algorithms.


Sign in / Sign up

Export Citation Format

Share Document