Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

Md Manjurul Ahsan; M. A. Parvez Mahmud; Pritom Kumar Saha; Kishor Datta Gupta; Zahed Siddique

doi:10.3390/technologies9030052

Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

Technologies ◽

10.3390/technologies9030052 ◽

2021 ◽

Vol 9 (3) ◽

pp. 52

Author(s):

Md Manjurul Ahsan ◽

M. A. Parvez Mahmud ◽

Pritom Kumar Saha ◽

Kishor Datta Gupta ◽

Zahed Siddique

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Missing Data ◽

Feature Reduction ◽

Machine Learning Algorithms ◽

Mixed Data ◽

Support Vector ◽

Tree Classifier ◽

Data Scaling ◽

Scaling Methods

Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.

Download Full-text

Heart Disease Prediction using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9780.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 700-704

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Machine Learning Techniques ◽

Support Vector ◽

Disease Prediction ◽

Nearest Neighbour ◽

Decision Tree Classifier ◽

Support Vector Classifier ◽

Learning Techniques ◽

Tree Classifier

Deriving the methodologies to detect heart issues at an earlier stage and intimating the patient to improve their health. To resolve this problem, we will use Machine Learning techniques to predict the incidence at an earlier stage. We have a tendency to use sure parameters like age, sex, height, weight, case history, smoking and alcohol consumption and test like pressure ,cholesterol, diabetes, ECG, ECHO for prediction. In machine learning there are many algorithms which will be used to solve this issue. The algorithms include K-Nearest Neighbour, Support vector classifier, decision tree classifier, logistic regression and Random Forest classifier. Using these parameters and algorithms we need to predict whether or not the patient has heart disease or not and recommend the patient to improve his/her health.

Download Full-text

Missing Value Imputation in Stature Estimation by Learning Algorithms Using Anthropometric Data: A Comparative Study

Applied Sciences ◽

10.3390/app10145020 ◽

2020 ◽

Vol 10 (14) ◽

pp. 5020

Author(s):

Youngdoo Son ◽

Wonjoon Kim

Keyword(s):

Machine Learning ◽

Missing Data ◽

Learning Algorithms ◽

Personal Identification ◽

Machine Learning Algorithms ◽

Lower Limbs ◽

Support Vector ◽

Anthropometric Data ◽

Stature Estimation ◽

Imputation Methods

Estimating stature is essential in the process of personal identification. Because it is difficult to find human remains intact at crime scenes and disaster sites, for instance, methods are needed for estimating stature based on different body parts. For instance, the upper and lower limbs may vary depending on ancestry and sex, and it is of great importance to design adequate methodology for incorporating these in estimating stature. In addition, it is necessary to use machine learning rather than simple linear regression to improve the accuracy of stature estimation. In this study, the accuracy of statures estimated based on anthropometric data was compared using three imputation methods. In addition, by comparing the accuracy among linear and nonlinear classification methods, the best method was derived for estimating stature based on anthropometric data. For both sexes, multiple imputation was superior when the missing data ratio was low, and mean imputation performed well when the ratio was high. The support vector machine recorded the highest accuracy in all ratios of missing data. The findings of this study showed appropriate imputation methods for estimating stature with missing anthropometric data. In particular, the machine learning algorithms can be effectively used for estimating stature in humans.

Download Full-text

Heart Disease Prediction Using Machine Learning

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1131 ◽

2021 ◽

pp. 267-276

Author(s):

Baban. U. Rindhe ◽

Nikita Ahire ◽

Rupali Patil ◽

Shweta Gagare ◽

Manisha Darade

Keyword(s):

Machine Learning ◽

Data Mining ◽

Heart Disease ◽

Heart Diseases ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Whole Body ◽

Support Vector ◽

Learning Techniques

Heart-related diseases or Cardiovascular Diseases (CVDs) are the main reason for a huge number of death in the world over the last few decades and has emerged as the most life-threatening disease, not only in India but in the whole world. So, there is a need fora reliable, accurate, and feasible system to diagnose such diseases in time for proper treatment. Machine Learning algorithms and techniques have been applied to various medical datasets to automate the analysis of large and complex data. Many researchers, in recent times, have been using several machine learning techniques to help the health care industry and the professionals in the diagnosis of heart-related diseases. Heart is the next major organ comparing to the brain which has more priority in the Human body. It pumps the blood and supplies it to all organs of the whole body. Prediction of occurrences of heart diseases in the medical field is significant work. Data analytics is useful for prediction from more information and it helps the medical center to predict various diseases. A huge amount of patient-related data is maintained on monthly basis. The stored data can be useful for the source of predicting the occurrence of future diseases. Some of the data mining and machine learning techniques are used to predict heart diseases, such as Artificial Neural Network (ANN), Random Forest,and Support Vector Machine (SVM).Prediction and diagnosingof heart disease become a challenging factor faced by doctors and hospitals both in India and abroad. To reduce the large scale of deaths from heart diseases, a quick and efficient detection technique is to be discovered. Data mining techniques and machine learning algorithms play a very important role in this area. The researchers accelerating their research works to develop software with thehelp of machine learning algorithms which can help doctors to decide both prediction and diagnosing of heart disease. The main objective of this research project is to predict the heart disease of a patient using machine learning algorithms.

Download Full-text

Framework for Providing Security in Private Cloud using Machine Learning Techniques

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f9121.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 7641-7645

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Learning Algorithms ◽

Feature Reduction ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Cyber Attack ◽

Learning Techniques

The advancement in cyber-attack technologies have ushered in various new attacks which are difficult to detect using traditional intrusion detection systems (IDS).Existing IDS are trained to detect known patterns because of which newer attacks bypass the current IDS and go undetected. In this paper, a two level framework is proposed which can be used to detect unknown new attacks using machine learning techniques. In the first level the known types of classes for attacks are determined using supervised machine learning algorithms such as Support Vector Machine (SVM) and Neural networks (NN). The second level uses unsupervised machine learning algorithms such as K-means. The experimentation is carried out with four models with NSL- KDD dataset in Openstack cloud environment. The Model with Support Vector Machine for supervised machine learning, Gradual Feature Reduction (GFR) for feature selection and K-means for unsupervised algorithm provided the optimum efficiency of 94.56 %.

Download Full-text

Ensemble-Based Machine Learning for Predicting Sudden Human Fall Using Health Data

Mathematical Problems in Engineering ◽

10.1155/2021/8608630 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Utkarsh Saxena ◽

Soumen Moulik ◽

Soumya Ranjan Nayak ◽

Thomas Hanne ◽

Diptendu Sinha Roy

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Majority Voting ◽

Support Vector ◽

Human Beings ◽

Medical Terminology ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Health Parameters

We attempt to predict the accidental fall of human beings due to sudden abnormal changes in their health parameters such as blood pressure, heart rate, and sugar level. In medical terminology, this problem is known as Syncope. The primary motivation is to prevent such falls by predicting abnormal changes in these health parameters that might trigger a sudden fall. We apply various machine learning algorithms such as logistic regression, a decision tree classifier, a random forest classifier, K-Nearest Neighbours (KNN), a support vector machine, and a naive Bayes classifier on a relevant dataset and verify our results with the cross-validation method. We observe that the KNN algorithm provides the best accuracy in predicting such a fall. However, the accuracy results of some other algorithms are also very close. Thus, we move one step further and propose an ensemble model, Majority Voting, which aggregates the prediction results of multiple machine learning algorithms and finally indicates the probability of a fall that corresponds to a particular human being. The proposed ensemble algorithm yields 87.42% accuracy, which is greater than the accuracy provided by the KNN algorithm.

Download Full-text

Comparison of Machine Learning Algorithms for the Quality Assessment of Wearable ECG Signals Via Lenovo H3 Devices

Journal of Medical and Biological Engineering ◽

10.1007/s40846-020-00588-7 ◽

2021 ◽

Author(s):

Fan Fu ◽

Wentao Xiang ◽

Yukun An ◽

Bin Liu ◽

Xianqing Chen ◽

...

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Quality Assessment ◽

Learning Algorithms ◽

Wearable Devices ◽

Machine Learning Algorithms ◽

Assessment System ◽

Support Vector ◽

Signal Quality ◽

Ecg Signals

Abstract Purpose Electrocardiogram (ECG) signals collected from wearable devices are easily corrupted with surrounding noise and artefacts, where the signal-to-noise ratio (SNR) of wearable ECG signals is significantly lower than that from hospital ECG machines. To meet the requirements for monitoring heart disease via wearable devices, eliminating useless or poor-quality ECG signals (e.g., lead-falls and low SNRs) can be solved by signal quality assessment algorithms. Methods To compensate for the deficiency of the existing ECG quality assessment system, a wearable ECG signal dataset from heart disease patients collected by Lenovo H3 devices was constructed. Then, this paper compares the performance of three machine learning algorithms, i.e., the traditional support vector machine (SVM), least-squares SVM (LS-SVM) and long short-term memory (LSTM) algorithms. Different non-morphological signal quality indices (i.e., the approximate entropy (ApEn), sample entropy (SaEn), fuzzy measure entropy (FMEn), Hurst exponent (HE), kurtosis (K) and power spectral density (PSD) features) extracted from the original ECG signals are fed into the three algorithms as input. Results The true positive rate, true negative rate, sensitivity and accuracy are used to evaluate the performance of each method, and the LSTM algorithm achieves the best results on these metrics (97.14%, 86.8%, 97.46% and 95.47%, respectively). Conclusions Among the three algorithms, the LSTM-based quality assessment method is the most suitable for the signals collected by the Lenovo H3 devices. The results also show that the combination of statistical features can effectively evaluate the quality of ECG signals.

Download Full-text

Prophecy on Programming Language using Machine Learning Algorithms

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35746 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 3699-3706

Author(s):

Komal Bhaskar Thube

Keyword(s):

Machine Learning ◽

Programming Language ◽

Machine Learning Algorithms ◽

Support Vector ◽

Computer Language ◽

Decision Tree Classifier ◽

Development Environment ◽

Tree Classifier ◽

Develop Software ◽

Neighbor Classifier

A programming language is a computer language developers use to develop software programs, scripts, or other sets of instruction for computers to execute. It is difficult to determine which programming language is widely used. In our work, I have analyzed and compared the classification results of various machine learning models and find out which programming language is widely used by developers. I have used Support Vector Machine (SVM), K neighbor classifier (KNN),Decision Tree Classifier(CART) for our comparative study. My task is to analyze different data and to classify them for the efficiency of each algorithm in terms of accuracy, precision, recall, and F1 Score. My best accuracy was 94.29% percent which was found using SVM. These techniques are coded in python and executed in Jupyter NoteBook, the Scientific Python Development Environment. Our experiments have shown that SVM is the best for predictive analysis and from our study that SVM is the well-suited algorithm for the prediction of the most widely used programming language.

Download Full-text

Heart Disease Prediction Using Machine Learning Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9340.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 2603-2606

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Decision Tree ◽

Learning Algorithm ◽

Heart Diseases ◽

Support Vector ◽

Machine Learning Algorithm ◽

Decision Tree Algorithm ◽

Tree Algorithm ◽

Tree Classifier

Heart disease is a common problem which can be very severe in old ages and also in people not having a healthy lifestyle. With regular check-up and diagnosis in addition to maintaining a decent eating habit can prevent it to some extent. In this paper we have tried to implement the most sought after and important machine learning algorithm to predict the heart disease in a patient. The decision tree classifier is implemented based on the symptoms which are specifically the attributes required for the purpose of prediction. Using the decision tree algorithm, we will be able to identify those attributes which are the best one that will lead us to a better prediction of the datasets. The decision tree algorithm works in a way where it tries to solve the problem by the help of tree representation. Here each internal node of the tree represents an attribute, and each leaf node corresponds to a class label. The support vector machine algorithm helps us to classify the datasets on the basis of kernel and it also groups the dataset using hyperplane. The main objective of this project is to try and reduce the number of occurrences of the heart diseases in patients

Download Full-text

Statistical Machine Learning Approaches to Liver Disease Prediction

Livers ◽

10.3390/livers1040023 ◽

2021 ◽

Vol 1 (4) ◽

pp. 294-312

Author(s):

Fahad Mostafa ◽

Easin Hasan ◽

Morgan Williamson ◽

Hafiz Khan

Keyword(s):

Machine Learning ◽

Liver Disease ◽

Missing Data ◽

Health Professionals ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Accuracy Score ◽

Statistical Machine Learning ◽

Medical Diagnoses

Medical diagnoses have important implications for improving patient care, research, and policy. For a medical diagnosis, health professionals use different kinds of pathological methods to make decisions on medical reports in terms of the patients’ medical conditions. Recently, clinicians have been actively engaged in improving medical diagnoses. The use of artificial intelligence and machine learning in combination with clinical findings has further improved disease detection. In the modern era, with the advantage of computers and technologies, one can collect data and visualize many hidden outcomes such as dealing with missing data in medical research. Statistical machine learning algorithms based on specific problems can assist one to make decisions. Machine learning (ML), data-driven algorithms can be utilized to validate existing methods and help researchers to make potential new decisions. The purpose of this study was to extract significant predictors for liver disease from the medical analysis of 615 humans using ML algorithms. Data visualizations were implemented to reveal significant findings such as missing values. Multiple imputations by chained equations (MICEs) were applied to generate missing data points, and principal component analysis (PCA) was used to reduce the dimensionality. Variable importance ranking using the Gini index was implemented to verify significant predictors obtained from the PCA. Training data (ntrain=399) for learning and testing data (ntest=216) in the ML methods were used for predicting classifications. The study compared binary classifier machine learning algorithms (i.e., artificial neural network, random forest (RF), and support vector machine), which were utilized on a published liver disease data set to classify individuals with liver diseases, which will allow health professionals to make a better diagnosis. The synthetic minority oversampling technique was applied to oversample the minority class to regulate overfitting problems. The RF significantly contributed (p<0.001) to a higher accuracy score of 98.14% compared to the other methods. Thus, this suggests that ML methods predict liver disease by incorporating the risk factors, which may improve the inference-based diagnosis of patients.

Download Full-text

A Hybrid Framework for Heart Disease Prediction Using Machine Learning Algorithms

E3S Web of Conferences ◽

10.1051/e3sconf/202130901043 ◽

2021 ◽

Vol 309 ◽

pp. 01043

Author(s):

L. Chandrika ◽

K. Madhavi

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Prediction Model ◽

Linear Model ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

The Past ◽

Past Data

Cardiovascular Diseases (CVDs) are the primary cause for the sudden death in the world today from the past few years the disease has emerged greatly as a most unpredictable problem, not only in India the whole planet facing the criticality. So, there is a desperate need of valid, accurate and practical solution or application to diagnose the CVD problems in time for mandatory treatment. Predicting the CVD is a great challenge in the health care domain of clinical data analysis. Machine learning Algorithms (MLA) and Techniques has been vastly developed and proven to be effective and efficient in predicting the problems using the past data. Using these MLA techniques and taking the clinical dataset which provided by the healthcare industry. Different studies were takes place and tried only a small part into predicting CVD with ML Algorithms. In this thesis, we propose the different novel methodology which concentrates at finding appropriate features by using MLA techniques resulting at finding out the accurate model to predict CVD. In this prediction model we are trying to implement the models with different combinations of features and several known classification techniques such as Deep Learning, Random Forest, Generalised Linear Model, Naïve Bayes, Logistic Regression, Decision Tree, Gradient Boosted trees, Support Vector Machine, Vote and HRFLM and we have got an higher accuracy level and of 75.8%, 85.1%, 82.9%, 87.4%, 85%, 86.1%, 78.3%, 86.1%, 87.41%, and 88.4% through the prediction model for heart disease with the hybrid random forest with a linear model (HRFLM).

Download Full-text