Modified balanced random forest for improving imbalanced data prediction

This paper proposes a Modified Balanced Random Forest (MBRF) algorithm as a classification technique to address imbalanced data. The MBRF process changes the process in a Balanced Random Forest by applying an under-sampling strategy based on clustering techniques for each data bootstrap decision tree in the Random Forest algorithm. To find the optimal performance of our proposed method compared with four clustering techniques, like: K-MEANS, Spectral Clustering, Agglomerative Clustering, and Ward Hierarchical Clustering. The experimental result show the Ward Hierarchical Clustering Technique achieved optimal performance, also the proposed MBRF method yielded better performance compared to the Balanced Random Forest (BRF) and Random Forest (RF) algorithms, with a sensitivity value or true positive rate (TPR) of 93.42%, a specificity or true negative rate (TNR) of 93.60%, and the best AUC accuracy value of 93.51%. Moreover, MBRF also reduced process running time.

Download Full-text

Improving Data Integrity of Individual-based Bibliographic Repository Using Clustering Techniques

Computer Engineering and Applications Journal ◽

10.18495/comengapp.v7i1.223 ◽

2018 ◽

Vol 7 (1) ◽

pp. 49-56

Author(s):

Firdaus Firdaus

Keyword(s):

Hierarchical Clustering ◽

Data Center ◽

Good Accuracy ◽

Data Integrity ◽

Agglomerative Clustering ◽

Name Disambiguation ◽

Raw Data ◽

Clustering Techniques ◽

Publication Data ◽

Hierarchical Agglomerative Clustering

This paper presents a method to improve data integrity of individual-based bibliographic repository. Integrity improvement is done by comparing individual-based publication raw data with individual-based clustered publication data. Hierarchical Agglomerative Clustering is used to cluster the publication data with similar author names. Clustering is done by two steps of clustering. The first clustering is based on the co-author relationship and the second is by title similarity and year difference. The two-step hierarchical clustering technique for name disambiguation has been applied to Universitas Sriwijaya Publication Data Center with good accuracy.

Download Full-text

Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest

Mathematics ◽

10.3390/math9212683 ◽

2021 ◽

Vol 9 (21) ◽

pp. 2683

Author(s):

Tzu-Hsuan Lin ◽

Jehn-Ruey Jiang

Keyword(s):

Random Forest ◽

Credit Card ◽

Characteristic Curve ◽

Fraud Detection ◽

True Positive Rate ◽

True Negative ◽

Credit Card Fraud ◽

Large Numbers ◽

Positive Rate ◽

Low Dimensionality

This paper proposes a method, called autoencoder with probabilistic random forest (AEPRF), for detecting credit card frauds. The proposed AE-PRF method first utilizes the autoencoder to extract features of low-dimensionality from credit card transaction data features of high-dimensionality. It then relies on the random forest, an ensemble learning mechanism using the bootstrap aggregating (bagging) concept, with probabilistic classification to classify data as fraudulent or normal. The credit card fraud detection (CCFD) dataset is applied to AE-PRF for performance evaluation and comparison. The CCFD dataset contains large numbers of credit card transactions of European cardholders; it is highly imbalanced since its normal transactions far outnumber fraudulent transactions. Data resampling schemes like the synthetic minority oversampling technique (SMOTE), adaptive synthetic (ADASYN), and Tomek link (T-Link) are applied to the CCFD dataset to balance the numbers of normal and fraudulent transactions for improving AE-PRF performance. Experimental results show that the performance of AE-PRF does not vary much whether resampling schemes are applied to the dataset or not. This indicates that AE-PRF is naturally suitable for dealing with imbalanced datasets. When compared with related methods, AE-PRF has relatively excellent performance in terms of accuracy, the true positive rate, the true negative rate, the Matthews correlation coefficient, and the area under the receiver operating characteristic curve.

Download Full-text

Random Forest and Novel Under-Sampling Strategy for Data Imbalance in Software Defect Prediction

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.15.21368 ◽

2018 ◽

Vol 7 (4.15) ◽

pp. 39 ◽

Cited By ~ 1

Author(s):

Utomo Pujianto ◽

. .

Keyword(s):

Random Forest ◽

Prediction Models ◽

Sampling Strategy ◽

True Positive Rate ◽

Quality Data ◽

Defect Prediction ◽

Software Defect Prediction ◽

Data Imbalance ◽

Software Defect ◽

Under Sampling

Data imbalance is one among characteristics of software quality data sets that can have a negative effect on the performance of software defect prediction models. This study proposed an alternative to random under-sampling strategy by using only a subset of non-defective data which have been calculated as having biggest distance value to the centroid of defective data. Combined with random forest classification, the proposed method outperformed both the random under-sampling and non-sampling method on the basis of accuracy, AUC, f-measure, and true positive rate performance measures.

Download Full-text

IMAGE THRESHOLDING BASED ON HIERARCHICAL CLUSTERING ANALYSIS AND PERCENTILE METHOD FOR TUNA IMAGE SEGMENTATION

NJCA (Nusantara Journal of Computers and Its Applications) ◽

10.36564/njca.v2i1.24 ◽

2018 ◽

Vol 2 (1) ◽

Author(s):

Alifia Puspaningrum ◽

Nahya Nur ◽

Ozzy Secio Riza ◽

Agus Zainal Arifin

Keyword(s):

Image Segmentation ◽

Hierarchical Clustering ◽

Gabor Filter ◽

Hierarchical Cluster ◽

Experimental Result ◽

Main Process ◽

Percentile Method ◽

Cluster Analysis Method ◽

2D Gabor Filter

Automatic classification of tuna image needs a good segmentation as a main process. Tuna image is taken with textural background and the tuna’s shadow behind the object. This paper proposed a new weighted thresholding method for tuna image segmentation which adapts hierarchical clustering analysisand percentile method. The proposed method considering all part of the image and the several part of the image. It will be used to estimate the object which the proportion has been known. To detect the edge of tuna images, 2D Gabor filter has been implemented to the image. The result image then threshold which the value has been calculated by using HCA and percentile method. The mathematical morphologies are applied into threshold image. In the experimental result, the proposed method can improve the accuracy value up to 20.04%, sensitivity value up to 29.94%, and specificity value up to 17,23% compared to HCA. The result shows that the proposed method cansegment tuna images well and more accurate than hierarchical cluster analysis method.

Download Full-text

Clustering Techniques for Secondary Substations Siting

Energies ◽

10.3390/en14041028 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1028

Author(s):

Silvia Corigliano ◽

Federico Rosato ◽

Carla Ortiz Dominguez ◽

Marco Merlo

Keyword(s):

Rural Areas ◽

Urban Areas ◽

Universal Access ◽

Distribution Networks ◽

Industrialized Countries ◽

Agglomerative Clustering ◽

Clustering Techniques ◽

Hierarchical Agglomerative Clustering ◽

Efficient Planning ◽

Target Set

The scientific community is active in developing new models and methods to help reach the ambitious target set by UN SDGs7: universal access to electricity by 2030. Efficient planning of distribution networks is a complex and multivariate task, which is usually split into multiple subproblems to reduce the number of variables. The present work addresses the problem of optimal secondary substation siting, by means of different clustering techniques. In contrast with the majority of approaches found in the literature, which are devoted to the planning of MV grids in already electrified urban areas, this work focuses on greenfield planning in rural areas. K-means algorithm, hierarchical agglomerative clustering, and a method based on optimal weighted tree partitioning are adapted to the problem and run on two real case studies, with different population densities. The algorithms are compared in terms of different indicators useful to assess the feasibility of the solutions found. The algorithms have proven to be effective in addressing some of the crucial aspects of substations siting and to constitute relevant improvements to the classic K-means approach found in the literature. However, it is found that it is very challenging to conjugate an acceptable geographical span of the area served by a single substation with a substation power high enough to justify the installation when the load density is very low. In other words, well known standards adopted in industrialized countries do not fit with developing countries’ requirements.

Download Full-text

Clustering electricity market participants via FRM models

Intelligent Decision Technologies ◽

10.3233/idt-200092 ◽

2020 ◽

pp. 1-12

Author(s):

Ayla Gülcü ◽

Sedrettin Çalişkan

Keyword(s):

Electricity Market ◽

Distance Metrics ◽

Agglomerative Clustering ◽

Timely Manner ◽

Short Term ◽

Clustering Techniques ◽

Market Participants ◽

Cosine Distance ◽

Collateral Mechanism

Collateral mechanism in the Electricity Market ensures the payments are executed on a timely manner; thus maintains the continuous cash flow. In order to value collaterals, Takasbank, the authorized central settlement bank, creates segments of the market participants by considering their short-term and long-term debt/credit information arising from all market activities. In this study, the data regarding participants’ daily and monthly debt payment and penalty behaviors is analyzed with the aim of discovering high-risk participants that fail to clear their debts on-time frequently. Different clustering techniques along with different distance metrics are considered to obtain the best clustering. Moreover, data preprocessing techniques along with Recency, Frequency, Monetary Value (RFM) scoring have been used to determine the best representation of the data. The results show that Agglomerative Clustering with cosine distance achieves the best separated clustering when the non-normalized dataset is used; this is also acknowledged by a domain expert.

Download Full-text

Research of Medical High-Dimensional Imbalanced Data Classification Ensemble Feature Selection Algorithm with Random Forest

2017 International Conference on Smart Grid and Electrical Automation (ICSGEA) ◽

10.1109/icsgea.2017.158 ◽

2017 ◽

Cited By ~ 2

Author(s):

Min Zhu ◽

Bo Su ◽

Gangmin Ning

Keyword(s):

Feature Selection ◽

Random Forest ◽

Imbalanced Data ◽

Data Classification ◽

High Dimensional ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Imbalanced Data Classification

Download Full-text

A comparative investigation of hierarchical clustering techniques and dissimilarity measures applied to the cell formation problem

Journal of Operations Management ◽

10.1016/0272-6963(95)00017-m ◽

1995 ◽

Vol 13 (2) ◽

pp. 117-138 ◽

Cited By ~ 42

Author(s):

Asoo J. Vakharia ◽

Urban Wemmerlöv

Keyword(s):

Hierarchical Clustering ◽

Cell Formation ◽

Comparative Investigation ◽

Cell Formation Problem ◽

Clustering Techniques ◽

Dissimilarity Measures ◽

Formation Problem

Download Full-text

Statistical Model-Based Classification to Detect Patient-Specific Spike-and-Wave in EEG Signals

10.20944/preprints202010.0616.v1 ◽

2020 ◽

Author(s):

Antonio Quintero Rincón ◽

Hadj Batatia ◽

Jorge Prende ◽

Valeria Muro ◽

Carlos D'Giano

Keyword(s):

Statistical Model ◽

Human Subjects ◽

True Positive Rate ◽

Wave Pattern ◽

Absence Epilepsy ◽

Patient Specific ◽

Eeg Signals ◽

True Negative ◽

Eeg Recordings

Spike-and-wave discharge (SWD) pattern detection in electroencephalography (EEG) signals is a key signal processing problem. It is particularly important for overcoming time-consuming, difficult, and error-prone manual analysis of long-term EEG recordings. This paper presents a new SWD method with a low computational complexity that can be easily trained with data from standard medical protocols. Precisely, EEG signals are divided into time segments for which the Morlet 1-D decomposition is applied. The generalized Gaussian distribution (GGD) statistical model is fitted to the resulting wavelet coefficients. A k-nearest neighbors (k-NN) self-supervised classifier is trained using the GGD parameters to detect the spike-and-wave pattern. Experiments were conducted using 106 spike-and-wave signals and 106 non-spike-and-wave signals for training and another 96 annotated EEG segments from six human subjects for testing. The proposed SWD classification methodology achieved 95 % sensitivity (True positive rate), 87% specificity (True Negative Rate), and 92% accuracy. These results set the path to new research to study causes underlying the so-called absence epilepsy in long-term EEG recordings.

Download Full-text

Design and development of a secure certificateless proxy signature based (SE-CLPS) encryption scheme for cloud storage

International Journal of Engineering & Technology ◽

10.14419/ijet.v10i1.21480 ◽

2021 ◽

Vol 10 (1) ◽

pp. 57

Author(s):

Ms. K. Sudharani ◽

Dr. N. K. Sakthivel

Keyword(s):

Computational Cost ◽

Public Key Cryptography ◽

True Positive Rate ◽

Proxy Signature ◽

Attack Detection ◽

Third Party ◽

Public Key ◽

Secret Key ◽

True Negative ◽

The Public

Certificateless Public Key Cryptography (CL-PKC) scheme is a new standard that combines Identity (ID)-based cryptography and tradi- tional PKC. It yields better security than the ID-based cryptography scheme without requiring digital certificates. In the CL-PKC scheme, as the Key Generation Center (KGC) generates a public key using a partial secret key, the need for authenticating the public key by a trusted third party is avoided. Due to the lack of authentication, the public key associated with the private key of a user may be replaced by anyone. Therefore, the ciphertext cannot be decrypted accurately. To mitigate this issue, an Enhanced Certificateless Proxy Signature (E-CLPS) is proposed to offer high security guarantee and requires minimum computational cost. In this work, the Hackman tool is used for detecting the dictionary attacks in the cloud. From the experimental analysis, it is observed that the proposed E-CLPS scheme yields better Attack Detection Rate, True Positive Rate, True Negative Rate and Minimum False Positives and False Negatives than the existing schemes.

Download Full-text