scholarly journals Modified balanced random forest for improving imbalanced data prediction

Author(s):  
Zahra Putri Agusta ◽  
Adiwijaya Adiwijaya

This paper proposes a Modified Balanced Random Forest (MBRF) algorithm as a classification technique to address imbalanced data. The MBRF process changes the process in a Balanced Random Forest by applying an under-sampling strategy based on clustering techniques for each data bootstrap decision tree in the Random Forest algorithm. To find the optimal performance of our proposed method compared with four clustering techniques, like: K-MEANS, Spectral Clustering, Agglomerative Clustering, and Ward Hierarchical Clustering. The experimental result show the Ward Hierarchical Clustering Technique achieved optimal performance, also the proposed MBRF method yielded better performance compared to the Balanced Random Forest (BRF) and Random Forest (RF) algorithms, with a sensitivity value or true positive rate (TPR) of 93.42%, a specificity or true negative rate (TNR) of 93.60%, and the best AUC accuracy value of 93.51%. Moreover, MBRF also reduced process running time.

2018 ◽  
Vol 7 (1) ◽  
pp. 49-56
Author(s):  
Firdaus Firdaus

This paper presents a method to improve data integrity of individual-based bibliographic repository. Integrity improvement is done by comparing individual-based publication raw data with individual-based clustered publication data. Hierarchical Agglomerative Clustering is used to cluster the publication data with similar author names. Clustering is done by two steps of clustering. The first clustering is based on the co-author relationship and the second is by title similarity and year difference. The two-step hierarchical clustering technique for name disambiguation has been applied to Universitas Sriwijaya Publication Data Center with good accuracy.


Mathematics ◽  
2021 ◽  
Vol 9 (21) ◽  
pp. 2683
Author(s):  
Tzu-Hsuan Lin ◽  
Jehn-Ruey Jiang

This paper proposes a method, called autoencoder with probabilistic random forest (AEPRF), for detecting credit card frauds. The proposed AE-PRF method first utilizes the autoencoder to extract features of low-dimensionality from credit card transaction data features of high-dimensionality. It then relies on the random forest, an ensemble learning mechanism using the bootstrap aggregating (bagging) concept, with probabilistic classification to classify data as fraudulent or normal. The credit card fraud detection (CCFD) dataset is applied to AE-PRF for performance evaluation and comparison. The CCFD dataset contains large numbers of credit card transactions of European cardholders; it is highly imbalanced since its normal transactions far outnumber fraudulent transactions. Data resampling schemes like the synthetic minority oversampling technique (SMOTE), adaptive synthetic (ADASYN), and Tomek link (T-Link) are applied to the CCFD dataset to balance the numbers of normal and fraudulent transactions for improving AE-PRF performance. Experimental results show that the performance of AE-PRF does not vary much whether resampling schemes are applied to the dataset or not. This indicates that AE-PRF is naturally suitable for dealing with imbalanced datasets. When compared with related methods, AE-PRF has relatively excellent performance in terms of accuracy, the true positive rate, the true negative rate, the Matthews correlation coefficient, and the area under the receiver operating characteristic curve.


2018 ◽  
Vol 7 (4.15) ◽  
pp. 39 ◽  
Author(s):  
Utomo Pujianto ◽  
. .

Data imbalance is one among characteristics of software quality data sets that can have a negative effect on the performance of software defect prediction models. This study proposed an alternative to random under-sampling strategy by using only a subset of non-defective data which have been calculated as having biggest distance value to the centroid of defective data. Combined with random forest       classification, the proposed method outperformed both the random under-sampling and non-sampling method on the basis of accuracy, AUC, f-measure, and true positive rate performance measures.  


Author(s):  
Alifia Puspaningrum ◽  
Nahya Nur ◽  
Ozzy Secio Riza ◽  
Agus Zainal Arifin

Automatic classification of tuna image needs a good segmentation as a main process. Tuna image is taken with textural background and the tuna’s shadow behind the object. This paper proposed a new weighted thresholding method for tuna image segmentation which adapts hierarchical clustering analysisand percentile method. The proposed method considering all part of the image and the several part of the image. It will be used to estimate the object which the proportion has been known. To detect the edge of tuna images, 2D Gabor filter has been implemented to the image. The result image then threshold which the value has been calculated by using HCA and percentile method. The mathematical morphologies are applied into threshold image. In the experimental result, the proposed method can improve the accuracy value up to 20.04%, sensitivity value up to 29.94%, and specificity value up to 17,23% compared to HCA. The result shows that the proposed method cansegment tuna images well and more accurate than hierarchical cluster analysis method.


Energies ◽  
2021 ◽  
Vol 14 (4) ◽  
pp. 1028
Author(s):  
Silvia Corigliano ◽  
Federico Rosato ◽  
Carla Ortiz Dominguez ◽  
Marco Merlo

The scientific community is active in developing new models and methods to help reach the ambitious target set by UN SDGs7: universal access to electricity by 2030. Efficient planning of distribution networks is a complex and multivariate task, which is usually split into multiple subproblems to reduce the number of variables. The present work addresses the problem of optimal secondary substation siting, by means of different clustering techniques. In contrast with the majority of approaches found in the literature, which are devoted to the planning of MV grids in already electrified urban areas, this work focuses on greenfield planning in rural areas. K-means algorithm, hierarchical agglomerative clustering, and a method based on optimal weighted tree partitioning are adapted to the problem and run on two real case studies, with different population densities. The algorithms are compared in terms of different indicators useful to assess the feasibility of the solutions found. The algorithms have proven to be effective in addressing some of the crucial aspects of substations siting and to constitute relevant improvements to the classic K-means approach found in the literature. However, it is found that it is very challenging to conjugate an acceptable geographical span of the area served by a single substation with a substation power high enough to justify the installation when the load density is very low. In other words, well known standards adopted in industrialized countries do not fit with developing countries’ requirements.


2020 ◽  
pp. 1-12
Author(s):  
Ayla Gülcü ◽  
Sedrettin Çalişkan

Collateral mechanism in the Electricity Market ensures the payments are executed on a timely manner; thus maintains the continuous cash flow. In order to value collaterals, Takasbank, the authorized central settlement bank, creates segments of the market participants by considering their short-term and long-term debt/credit information arising from all market activities. In this study, the data regarding participants’ daily and monthly debt payment and penalty behaviors is analyzed with the aim of discovering high-risk participants that fail to clear their debts on-time frequently. Different clustering techniques along with different distance metrics are considered to obtain the best clustering. Moreover, data preprocessing techniques along with Recency, Frequency, Monetary Value (RFM) scoring have been used to determine the best representation of the data. The results show that Agglomerative Clustering with cosine distance achieves the best separated clustering when the non-normalized dataset is used; this is also acknowledged by a domain expert.


Author(s):  
Antonio Quintero Rincón ◽  
Hadj Batatia ◽  
Jorge Prende ◽  
Valeria Muro ◽  
Carlos D'Giano

Spike-and-wave discharge (SWD) pattern detection in electroencephalography (EEG) signals is a key signal processing problem. It is particularly important for overcoming time-consuming, difficult, and error-prone manual analysis of long-term EEG recordings. This paper presents a new SWD method with a low computational complexity that can be easily trained with data from standard medical protocols. Precisely, EEG signals are divided into time segments for which the Morlet 1-D decomposition is applied. The generalized Gaussian distribution (GGD) statistical model is fitted to the resulting wavelet coefficients. A k-nearest neighbors (k-NN) self-supervised classifier is trained using the GGD parameters to detect the spike-and-wave pattern. Experiments were conducted using 106 spike-and-wave signals and 106 non-spike-and-wave signals for training and another 96 annotated EEG segments from six human subjects for testing. The proposed SWD classification methodology achieved 95 % sensitivity (True positive rate), 87% specificity (True Negative Rate), and 92% accuracy. These results set the path to new research to study causes underlying the so-called absence epilepsy in long-term EEG recordings.


2021 ◽  
Vol 10 (1) ◽  
pp. 57
Author(s):  
Ms. K. Sudharani ◽  
Dr. N. K. Sakthivel

Certificateless Public Key Cryptography (CL-PKC) scheme is a new standard that combines Identity (ID)-based cryptography and tradi- tional PKC. It yields better security than the ID-based cryptography scheme without requiring digital certificates. In the CL-PKC scheme, as the Key Generation Center (KGC) generates a public key using a partial secret key, the need for authenticating the public key by a trusted third party is avoided. Due to the lack of authentication, the public key associated with the private key of a user may be replaced by anyone. Therefore, the ciphertext cannot be decrypted accurately. To mitigate this issue, an Enhanced Certificateless Proxy Signature (E-CLPS) is proposed to offer high security guarantee and requires minimum computational cost. In this work, the Hackman tool is used for detecting the dictionary attacks in the cloud. From the experimental analysis, it is observed that the proposed E-CLPS scheme yields better Attack Detection Rate, True Positive Rate, True Negative Rate and Minimum False Positives and False Negatives than the existing schemes.   


Sign in / Sign up

Export Citation Format

Share Document