SMOTE-ENC: A Novel SMOTE-Based Method to Generate Synthetic Data for Nominal and Continuous Features

Real-world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these underrepresented instances. To solve this problem, many variations of synthetic minority oversampling methods (SMOTE) have been proposed to balance datasets which deal with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based oversampling technique to balance the data. In this paper, we present a novel minority oversampling method, SMOTE-ENC (SMOTE—Encoded Nominal and Continuous), in which nominal features are encoded as numeric values and the difference between two such numeric values reflects the amount of change of association with the minority class. Our experiments show that classification models using the SMOTE-ENC method offer better prediction than models using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of the SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied to both mixed datasets and nominal-only datasets.

Download Full-text

Imbalance class problems in data mining: a review

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v14.i3.pp1552-1563 ◽

2019 ◽

Vol 14 (3) ◽

pp. 1552 ◽

Cited By ~ 3

Author(s):

Haseeb Ali ◽

Mohd Najib Mohd Salleh ◽

Rohmat Saedudin ◽

Kashif Hussain ◽

Muhammad Faheem Mushtaq

Keyword(s):

Machine Learning ◽

Data Mining ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

The Other ◽

Minority Class ◽

Future Directions ◽

Significant Research ◽

Comprehensive Survey ◽

Imbalanced Class

<span>The imbalanced data problems in data mining are common nowadays, which occur due to skewed nature of data. These problems impact the classification process negatively in machine learning process. In such problems, classes have different ratios of specimens in which a large number of specimens belong to one class and the other class has fewer specimens that is usually an essential class, but unfortunately misclassified by many classifiers. So far, significant research is performed to address the imbalanced data problems by implementing different techniques and approaches. In this research, a comprehensive survey is performed to identify the challenges of handling imbalanced class problems during classification process using machine learning algorithms. We discuss the issues of classifiers which endorse bias for majority class and ignore the minority class. Furthermore, the viable solutions and potential future directions are provided to handle the problems<em>.</em></span>

Download Full-text

Machine-Learning-Based Radiomics MRI Model for Survival Prediction of Recurrent Glioblastomas Treated with Bevacizumab

Diagnostics ◽

10.3390/diagnostics11071263 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1263

Author(s):

Samy Ammari ◽

Raoul Sallé de Chou ◽

Tarek Assi ◽

Mehdi Touat ◽

Emilie Chouzenoux ◽

...

Keyword(s):

Machine Learning ◽

Therapeutic Option ◽

Binary Classification ◽

Progression Free Survival ◽

Recurrent Glioblastoma ◽

Machine Learning Algorithms ◽

Survival Prediction ◽

Classification Models ◽

Angiogenic Therapy ◽

Recurrent Gbm

Anti-angiogenic therapy with bevacizumab is a widely used therapeutic option for recurrent glioblastoma (GBM). Nevertheless, the therapeutic response remains highly heterogeneous among GBM patients with discordant outcomes. Recent data have shown that radiomics, an advanced recent imaging analysis method, can help to predict both prognosis and therapy in a multitude of solid tumours. The objective of this study was to identify novel biomarkers, extracted from MRI and clinical data, which could predict overall survival (OS) and progression-free survival (PFS) in GBM patients treated with bevacizumab using machine-learning algorithms. In a cohort of 194 recurrent GBM patients (age range 18–80), radiomics data from pre-treatment T2 FLAIR and gadolinium-injected MRI images along with clinical features were analysed. Binary classification models for OS at 9, 12, and 15 months were evaluated. Our classification models successfully stratified the OS. The AUCs were equal to 0.78, 0.85, and 0.76 on the test sets (0.79, 0.82, and 0.87 on the training sets) for the 9-, 12-, and 15-month endpoints, respectively. Regressions yielded a C-index of 0.64 (0.74) for OS and 0.57 (0.69) for PFS. These results suggest that radiomics could assist in the elaboration of a predictive model for treatment selection in recurrent GBM patients.

Download Full-text

Machine Learning Model of Dimensionless Numbers to Predict Flow Patterns and Droplet Characteristics for Two-Phase Digital Flows

Applied Sciences ◽

10.3390/app11094251 ◽

2021 ◽

Vol 11 (9) ◽

pp. 4251

Author(s):

Jinsong Zhang ◽

Shuai Zhang ◽

Jianhua Zhang ◽

Zhiliang Wang

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Digital Microfluidics ◽

Flow Patterns ◽

Machine Learning Algorithms ◽

Dimensionless Numbers ◽

Two Phase ◽

The Difference ◽

Input Variables ◽

Digital Microfluidic

In the digital microfluidic experiments, the droplet characteristics and flow patterns are generally identified and predicted by the empirical methods, which are difficult to process a large amount of data mining. In addition, due to the existence of inevitable human invention, the inconsistent judgment standards make the comparison between different experiments cumbersome and almost impossible. In this paper, we tried to use machine learning to build algorithms that could automatically identify, judge, and predict flow patterns and droplet characteristics, so that the empirical judgment was transferred to be an intelligent process. The difference on the usual machine learning algorithms, a generalized variable system was introduced to describe the different geometry configurations of the digital microfluidics. Specifically, Buckingham’s theorem had been adopted to obtain multiple groups of dimensionless numbers as the input variables of machine learning algorithms. Through the verification of the algorithms, the SVM and BPNN algorithms had classified and predicted the different flow patterns and droplet characteristics (the length and frequency) successfully. By comparing with the primitive parameters system, the dimensionless numbers system was superior in the predictive capability. The traditional dimensionless numbers selected for the machine learning algorithms should have physical meanings strongly rather than mathematical meanings. The machine learning algorithms applying the dimensionless numbers had declined the dimensionality of the system and the amount of computation and not lose the information of primitive parameters.

Download Full-text

Multivariate Analysis for the Classification of Chocolate According to its Percentage of Cocoa by Using Terahertz Time-Domain Spectroscopy (THz-TDS)

Proceedings ◽

10.3390/foods_2020-08029 ◽

2020 ◽

Vol 70 (1) ◽

pp. 109

Author(s):

Jimy Oblitas ◽

Jorge Ruiz

Keyword(s):

Machine Learning ◽

Time Domain ◽

Electromagnetic Pulse ◽

Machine Learning Algorithms ◽

Classification Models ◽

Terahertz Time Domain Spectroscopy ◽

Time Domain Spectroscopy ◽

Svm Algorithm ◽

Classification Of Images

Terahertz time-domain spectroscopy is a useful technique for determining some physical characteristics of materials, and is based on selective frequency absorption of a broad-spectrum electromagnetic pulse. In order to investigate the potential of this technology to classify cocoa percentages in chocolates, the terahertz spectra (0.5–10 THz) of five chocolate samples (50%, 60%, 70%, 80% and 90% of cocoa) were examined. The acquired data matrices were analyzed with the MATLAB 2019b application, from which the dielectric function was obtained along with the absorbance curves, and were classified by using 24 mathematical classification models, achieving differentiations of around 93% obtained by the Gaussian SVM algorithm model with a kernel scale of 0.35 and a one-against-one multiclass method. It was concluded that the combined processing and classification of images obtained from the terahertz time-domain spectroscopy and the use of machine learning algorithms can be used to successfully classify chocolates with different percentages of cocoa.

Download Full-text

Data mining of coronavirus: SARS-CoV-2, SARS-CoV and MERS-CoV

BMC Research Notes ◽

10.1186/s13104-021-05561-4 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Jung Eun Huh ◽

Seunghee Han ◽

Taeseon Yoon

Keyword(s):

Machine Learning ◽

Amino Acid ◽

Amino Acid Sequence ◽

Decision Tree ◽

Machine Learning Algorithms ◽

High Similarity ◽

Incubation Periods ◽

Initial Question ◽

The Difference ◽

Blast Program

Abstract Objective In this study we compare the amino acid and codon sequence of SARS-CoV-2, SARS-CoV and MERS-CoV using different statistics programs to understand their characteristics. Specifically, we are interested in how differences in the amino acid and codon sequence can lead to different incubation periods and outbreak periods. Our initial question was to compare SARS-CoV-2 to different viruses in the coronavirus family using BLAST program of NCBI and machine learning algorithms. Results The result of experiments using BLAST, Apriori and Decision Tree has shown that SARS-CoV-2 had high similarity with SARS-CoV while having comparably low similarity with MERS-CoV. We decided to compare the codons of SARS-CoV-2 and MERS-CoV to see the difference. Though the viruses are very alike according to BLAST and Apriori experiments, SVM proved that they can be effectively classified using non-linear kernels. Decision Tree experiment proved several remarkable properties of SARS-CoV-2 amino acid sequence that cannot be found in MERS-CoV amino acid sequence. The consequential purpose of this paper is to minimize the damage on humanity from SARS-CoV-2. Hence, further studies can be focused on the comparison of SARS-CoV-2 virus with other viruses that also can be transmitted during latent periods.

Download Full-text

Image-to-Image Translation with Multi-Path Consistency Regularization

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/413 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jianxin Lin ◽

Yingce Xia ◽

Yijun Wang ◽

Tao Qin ◽

Zhibo Chen

Keyword(s):

Machine Learning ◽

Computer Vision ◽

The Other ◽

Natural Images ◽

Original Image ◽

Target Domain ◽

Source Domain ◽

Face To Face ◽

Image Translation ◽

The Difference

Image translation across different domains has attracted much attention in both machine learning and computer vision communities. Taking the translation from a source domain to a target domain as an example, existing algorithms mainly rely on two kinds of loss for training: One is the discrimination loss, which is used to differentiate images generated by the models and natural images; the other is the reconstruction loss, which measures the difference between an original image and the reconstructed version. In this work, we introduce a new kind of loss, multi-path consistency loss, which evaluates the differences between direct translation from source domain to target domain and indirect translation from source domain to an auxiliary domain to target domain, to regularize training. For multi-domain translation (at least, three) which focuses on building translation models between any two domains, at each training iteration, we randomly select three domains, set them respectively as the source, auxiliary and target domains, build the multi-path consistency loss and optimize the network. For two-domain translation, we need to introduce an additional auxiliary domain and construct the multi-path consistency loss. We conduct various experiments to demonstrate the effectiveness of our proposed methods, including face-to-face translation, paint-to-photo translation, and de-raining/de-noising translation.

Download Full-text

Construction of Rapid Early Warning and Comprehensive Analysis Models for Urban Waterlogging Based on AutoML and Comparison of the other three Machine Learning Algorithms

Journal of Hydrology ◽

10.1016/j.jhydrol.2021.127367 ◽

2021 ◽

pp. 127367

Author(s):

Yuchen Guo ◽

Lihong Quan ◽

Lili Song ◽

Hao Liang

Keyword(s):

Machine Learning ◽

Early Warning ◽

Learning Algorithms ◽

Comprehensive Analysis ◽

Machine Learning Algorithms ◽

The Other ◽

Analysis Models

Download Full-text

A Literature Review on Thyroid Hormonal Problems in Women Using Data Science and Analytics

Advances in Data Mining and Database Management - Handbook of Research on Engineering, Business, and Healthcare Applications of Data Science and Analytics ◽

10.4018/978-1-7998-3053-5.ch021 ◽

2021 ◽

pp. 416-428

Author(s):

R. Suganya ◽

Rajaram S. ◽

Kameswari M.

Keyword(s):

Machine Learning ◽

Literature Review ◽

Data Science ◽

Learning Algorithms ◽

Research Literature ◽

Machine Learning Algorithms ◽

Thyroid Disorder ◽

Classification Models ◽

Indian Women ◽

Using Data

Currently, thyroid disorders are more common and widespread among women worldwide. In India, seven out of ten women are suffering from thyroid problems. Various research literature studies predict that about 35% of Indian women are examined with prevalent goiter. It is very necessary to take preventive measures at its early stages, otherwise it causes infertility problem among women. The recent review discusses various analytics models that are used to handle different types of thyroid problems in women. This chapter is planned to analyze and compare different classification models, both machine learning algorithms and deep leaning algorithms, to classify different thyroid problems. Literature from both machine learning and deep learning algorithms is considered. This literature review on thyroid problems will help to analyze the reason and characteristics of thyroid disorder. The dataset used to build and to validate the algorithms was provided by UCI machine learning repository.

Download Full-text

A novel method for detecting disk filtration attacks via the various machine learning algorithms

China Communications ◽

10.23919/jcc.2020.04.010 ◽

2020 ◽

Vol 17 (4) ◽

pp. 99-108

Author(s):

Weijun Zhu ◽

Mingliang Xu

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Novel Method

Download Full-text

Empirical Comparison of Various Discretization Procedures

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001498000567 ◽

1998 ◽

Vol 12 (07) ◽

pp. 1017-1032 ◽

Cited By ~ 10

Author(s):

Petr Berka ◽

Ivan Bruha

Keyword(s):

Machine Learning ◽

Real World ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

The Other ◽

Machine Learning Algorithm ◽

Empirical Comparison ◽

Numerical Attributes ◽

Real World Problems ◽

Discretization Procedure

The genuine symbolic machine learning (ML) algorithms are capable of processing symbolic, categorial data only. However, real-world problems, e.g. in medicine or finance, involve both symbolic and numerical attributes. Therefore, there is an important issue of ML to discretize (categorize) numerical attributes. There exist quite a few discretization procedures in the ML field. This paper describes two newer algorithms for categorization (discretization) of numerical attributes. The first one is implemented in the KEX (Knowledge EXplorer) as its preprocessing procedure. Its idea is to discretize the numerical attributes in such a way that the resulting categorization corresponds to KEX knowledge acquisition algorithm. Since the categorization for KEX is done "off-line" before using the KEX machine learning algorithm, it can be used as a preprocessing step for other machine learning algorithms, too. The other discretization procedure is implemented in CN4, a large extension of the well-known CN2 machine learning algorithm. The range of numerical attributes is divided into intervals that may form a complex generated by the algorithm as a part of the class description. Experimental results show a comparison of performance of KEX and CN4 on some well-known ML databases. To make the comparison more exhibitory, we also used the discretization procedure of the MLC++ library. Other ML algorithms such as ID3 and C4.5 were run under our experiments, too. Then, the results are compared and discussed.

Download Full-text