Integrating Ensembling Schemes with Classification for Customer Group Prediction using Machine Learning

In current marketing scenario, it is highly difficult to earn the high profit by satisfying the customers as well to increase to turn over of the company. For increasing the profit, the organizations are struggling to find a method to analyze their marketing strategy and to understand the customer’s requirements. The main solution to increase the profit of any organization is to manufacture the limited and the needed goods based on the customer’s needs and dislikes. For this, they need to find the customers behavior and the opinion regarding their products. This claims the usage of machine learning algorithms to predict and analyze the behavior of the customer. With this information scenario, we have extracted the wine data set from UCI Machine learning repository. The wine data set is analyzed to decide the dependent and independent variable. The dimensionality reduction is done by applying the ensembling methods. The feature importance of the various ensembling methods like Ada boost regressor, Ada boost classifier, Random forest regressor, Extra Trees Regressor and Gradient booster regressor. The extracted feature importance of the wine data set is fitted with logistic regression classifier to analyse the performance of the each ensembling methods. The metrics used for performance analysis are accuracy, precision, recall, and f-score. Experimental results shows that feature importance obtained from Ada Boost regressor fitted with logistic regression classifier is found to be effective with the accuracy of 94%, Precision of 0.95, Recall of 0.94 and FScore of 0.94 compared to other ensembling methods.

Download Full-text

Regressor Fitting Of Feature Importance For Customer Segment Prediction With Ensembling Schemes Using Machine Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8255.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 952-956 ◽

Cited By ~ 2

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Mean Squared Error ◽

Absolute Error ◽

Machine Learning Algorithms ◽

Manufacturing Companies ◽

Data Set ◽

Feature Importance ◽

Customer Segment ◽

Feature Scaling

Prediction of client behavior and their feedback remains as a challenging task in today’s world for all the manufacturing companies. The companies are struggling to increase their profit and annual turnover due to the lack of exact prediction of customer like and dislike. This leads to the accomplishment of machine learning algorithms for the prediction of customer demands. This paper attempts to identify the important features of the wine data set extracted from UCI Machine learning repository for the prediction of customer segment. The important features are extracted for the various ensembling methods like Ada boost regressor, Ada boost classifier, Random forest regressor, Extra Trees Regressor, Gradient booster regressor. The extracted feature importance of each of the ensembling methods is then fitted with logistic regression to analyze the performance. The same extracted feature importance of each of the ensembling methods are subjected to feature scaling and then fitted with logistic regression to analyze the performance. The Performance analysis is done with the performance metric such as Mean Squared error (MSE), Mean Absolute error (MAE), R2 Score, Explained Variance Score (EVS) and Mean Squared Log Error (MSLE). Experimental results shows that after applying feature scaling, the feature importance extracted from the Extra Tree Regressor is found to be effective with the MSE of 0.04, MAE of 0.03, R2 Score of 94%, EVS of 0.9 and MSLE of 0.01 as compared to other ensembling methods.

Download Full-text

Ensembling Coalesce of Logistic Regression Classifier for Heart Disease Prediction using Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3473.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 127-133

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Heart Diseases ◽

Experimental Results ◽

Disease Prediction ◽

Feature Importance ◽

The World ◽

Feature Scaling ◽

Logistic Regression Classifier

In today’s modern world, the world population is affected with some kind of heart diseases. With the vast knowledge and advancement in applications, the analysis and the identification of the heart disease still remain as a challenging issue. Due to the lack of awareness in the availability of patient symptoms, the prediction of heart disease is a questionable task. The World Health Organization has released that 33% of population were died due to the attack of heart diseases. With this background, we have used Heart Disease Prediction dataset extracted from UCI Machine Learning Repository for analyzing and the prediction of heart disease by integrating the ensembling methods. The prediction of heart disease classes are achieved in four ways. Firstly, The important features are extracted for the various ensembling methods like Extra Trees Regressor, Ada boost regressor, Gradient booster regress, Random forest regressor and Ada boost classifier. Secondly, the highly importance features of each of the ensembling methods is filtered from the dataset and it is fitted to logistic regression classifier to analyze the performance. Thirdly, the same extracted important features of each of the ensembling methods are subjected to feature scaling and then fitted with logistic regression to analyze the performance. Fourth, the Performance analysis is done with the performance metric such as Mean Squared error (MSE), Mean Absolute error (MAE), R2 Score, Explained Variance Score (EVS) and Mean Squared Log Error (MSLE). The implementation is done using python language under Spyder platform with Anaconda Navigator. Experimental results shows that before applying feature scaling, the feature importance extracted from the Ada boost classifier is found to be effective with the MSE of 0.04, MAE of 0.07, R2 Score of 92%, EVS of 0.86 and MSLE of 0.16 as compared to other ensembling methods. Experimental results shows that after applying feature scaling, the feature importance extracted from the Ada boost classifier is found to be effective with the MSE of 0.09, MAE of 0.13, R2 Score of 91%, EVS of 0.93 and MSLE of 0.18 as compared to other ensembling methods.

Download Full-text

Evaluation of Prognosis in Nasopharyngeal Cancer Using Machine Learning

Technology in Cancer Research & Treatment ◽

10.1177/1533033820909829 ◽

2020 ◽

Vol 19 ◽

pp. 153303382090982

Author(s):

Melek Akcay ◽

Durmus Etiz ◽

Ozer Celik ◽

Alaattin Ozen

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Naive Bayes ◽

Nasopharyngeal Cancer ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Support Vector ◽

Tumor Diameter ◽

Survival Prognosis ◽

Data Set

Background and Aim: Although the prognosis of nasopharyngeal cancer largely depends on a classification based on the tumor-lymph node metastasis staging system, patients at the same stage may have different clinical outcomes. This study aimed to evaluate the survival prognosis of nasopharyngeal cancer using machine learning. Settings and Design: Original, retrospective. Materials and Methods: A total of 72 patients with a diagnosis of nasopharyngeal cancer who received radiotherapy ± chemotherapy were included in the study. The contribution of patient, tumor, and treatment characteristics to the survival prognosis was evaluated by machine learning using the following techniques: logistic regression, artificial neural network, XGBoost, support-vector clustering, random forest, and Gaussian Naive Bayes. Results: In the analysis of the data set, correlation analysis, and binary logistic regression analyses were applied. Of the 18 independent variables, 10 were found to be effective in predicting nasopharyngeal cancer-related mortality: age, weight loss, initial neutrophil/lymphocyte ratio, initial lactate dehydrogenase, initial hemoglobin, radiotherapy duration, tumor diameter, number of concurrent chemotherapy cycles, and T and N stages. Gaussian Naive Bayes was determined as the best algorithm to evaluate the prognosis of machine learning techniques (accuracy rate: 88%, area under the curve score: 0.91, confidence interval: 0.68-1, sensitivity: 75%, specificity: 100%). Conclusion: Many factors affect prognosis in cancer, and machine learning algorithms can be used to determine which factors have a greater effect on survival prognosis, which then allows further research into these factors. In the current study, Gaussian Naive Bayes was identified as the best algorithm for the evaluation of prognosis of nasopharyngeal cancer.

Download Full-text

Prediction on Domestic Violence in Bangladesh during the COVID-19 Outbreak Using Machine Learning Methods

Applied System Innovation ◽

10.3390/asi4040077 ◽

2021 ◽

Vol 4 (4) ◽

pp. 77

Author(s):

Md. Murad Hossain ◽

Md. Asadullah ◽

Abidur Rahaman ◽

Md. Sipon Miah ◽

M. Zahid Hasan ◽

...

Keyword(s):

Machine Learning ◽

Domestic Violence ◽

Logistic Regression ◽

Random Forest ◽

Family Violence ◽

Violence Against Women ◽

Machine Learning Algorithms ◽

Data Set ◽

Domestic Violence Against Women ◽

Women And Children

The COVID-19 outbreak resulted in preventative measures and restrictions for Bangladesh during the summer of 2020—these unstable and stressful times led to multiple social problems (e.g., domestic violence and divorce). Globally, researchers, policymakers, governments, and civil societies have been concerned about the increase in domestic violence against women and children during the ongoing COVID-19 pandemic. In Bangladesh, domestic violence against women and children has increased during the COVID-19 pandemic. In this article, we investigated family violence among 511 families during the COVID-19 outbreak. Participants were given questionnaires to answer, for a period of over ten days; we predicted family violence using a machine learning-based model. To predict domestic violence from our data set, we applied random forest, logistic regression, and Naive Bayes machine learning algorithms to our model. We employed an oversampling strategy named the Synthetic Minority Oversampling Technique (SMOTE) and the chi-squared statistical test to, respectively, solve the imbalance problem and discover the feature importance of our data set. The performances of the machine learning algorithms were evaluated based on accuracy, precision, recall, and F-score criteria. Finally, the receiver operating characteristic (ROC) and confusion matrices were developed and analyzed for three algorithms. On average, our model, with the random forest, logistic regression, and Naive Bayes algorithms, predicted family violence with 77%, 69%, and 62% accuracy for our data set. The findings of this study indicate that domestic violence has increased and is highly related to two features: family income level during the COVID-19 pandemic and education level of the family members.

Download Full-text

Classification Comparative Analysis for Detection of Brain Tumor Using Neural Network, Logistic Regression & KNN Classifier with VGG19 Convolution Neural Network Feature Extraction

10.21467/proceedings.114.6 ◽

2021 ◽

Author(s):

Vijaya Kamble ◽

Rohin Daruwala

Keyword(s):

Neural Network ◽

Machine Learning ◽

Feature Extraction ◽

Logistic Regression ◽

Brain Tumor ◽

Medical Image Analysis ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Data Set ◽

Knn Classifier

In recent years due to advancements in digital imaging machine learning techniques are used in medical image analysis for the prognosis and diagnosis of various abnormalities in the human body. Various Machine learning algorithms, convolution and deep neural networks are used for classification, detection and prediction of various brain tumors. The proposed approach is a different comparative classification analysis approach which is based on three different classification namely KNN classifier,Logistic regression & neural network as classifier. It is based on a deep learning feature extraction technique using VGG19. This VGG 19-layer image recognition model trained on Imgenet. Generally, MRI data sequences are analyzed in terms of different modalities and every modality contains rich tissue information. So, feature exaction from MRI sequences is very important task for brain tumor classification. Our approach demonstrated fair classification on BRATS Benchmarks 2018 data set with different modalities and sizes of images,results are without any human annotations. Based on selected classifiers all the classifiers gives accuracy above 90%. It is good compared to other state of art methods.

Download Full-text

Customer Segment Prognostic System by Machine Learning using Principal Component and Linear Discriminant Analysis

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2290.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 6198-6203

Keyword(s):

Machine Learning ◽

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Principal Component ◽

Customer Behavior ◽

Machine Learning Algorithms ◽

Data Set ◽

Linear Discriminant ◽

Customer Group

Recently, manufacturing industry faces lots of problem in predicting the customer behavior and group for matching their outcome with the profit. The organizations are finding difficult in identifying the customer behavior for the purpose of predicting the product design so as to increase the profit. The prediction of customer group is a challenging task for all the organization due to the current growing entrepreneurs. This results in using the machine learning algorithms to cluster the customer group for predicting the demand of the customers. This helps in decision making process of manufacturing the products. This paper attempts to predict the customer group for the wine data set extracted from UCI Machine Learning repository. The wine data set is subjected to dimensionality reduction with principal component analysis and linear discriminant analysis. A Performance analysis is done with various classification algorithms and comparative study is done with the performance metric such as accuracy, precision, recall, and f-score. Experimental results shows that after applying dimensionality reduction, the 2 component LDA reduced wine data set with the kernel SVM, Random Forest classifier is found to be effective with the accuracy of 100% compared to other classifiers.

Download Full-text

Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification

Journal for the Measurement of Physical Behaviour ◽

10.1123/jmpb.2020-0016 ◽

2020 ◽

pp. 1-9

Author(s):

Supun Nakandala ◽

Marta M. Jankowska ◽

Fatima Tuz-Zahra ◽

John Bellettiere ◽

Jordan A. Carlson ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Logistic Regression ◽

Random Forest ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Feature Engineering ◽

Free Living ◽

Data Set

Background: Machine learning has been used for classification of physical behavior bouts from hip-worn accelerometers; however, this research has been limited due to the challenges of directly observing and coding human behavior “in the wild.” Deep learning algorithms, such as convolutional neural networks (CNNs), may offer better representation of data than other machine learning algorithms without the need for engineered features and may be better suited to dealing with free-living data. The purpose of this study was to develop a modeling pipeline for evaluation of a CNN model on a free-living data set and compare CNN inputs and results with the commonly used machine learning random forest and logistic regression algorithms. Method: Twenty-eight free-living women wore an ActiGraph GT3X+ accelerometer on their right hip for 7 days. A concurrently worn thigh-mounted activPAL device captured ground truth activity labels. The authors evaluated logistic regression, random forest, and CNN models for classifying sitting, standing, and stepping bouts. The authors also assessed the benefit of performing feature engineering for this task. Results: The CNN classifier performed best (average balanced accuracy for bout classification of sitting, standing, and stepping was 84%) compared with the other methods (56% for logistic regression and 76% for random forest), even without performing any feature engineering. Conclusion: Using the recent advancements in deep neural networks, the authors showed that a CNN model can outperform other methods even without feature engineering. This has important implications for both the model’s ability to deal with the complexity of free-living data and its potential transferability to new populations.

Download Full-text

Prediction of addiction to drugs and alcohol using machine learning: A case study on Bangladeshi population

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i5.pp4471-4480 ◽

2021 ◽

Vol 11 (5) ◽

pp. 4471

Author(s):

Md. Ariful Islam Arif ◽

Saiful Islam Sany ◽

Farah Sharmin ◽

Md. Sadekur Rahman ◽

Md. Tarek Habib

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Performance Metrics ◽

Learning Algorithms ◽

Principal Component ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Data Set ◽

Adaptive Boosting ◽

Drugs And Alcohol

Nowadays addiction to drugs and alcohol has become a significant threat to the youth of the society as Bangladesh’s population. So, being a conscientious member of society, we must go ahead to prevent these young minds from life-threatening addiction. In this paper, we approach a machinelearning-based way to forecast the risk of becoming addicted to drugs using machine-learning algorithms. First, we find some significant factors for addiction by talking to doctors, drug-addicted people, and read relevant articles and write-ups. Then we collect data from both addicted and nonaddicted people. After preprocessing the data set, we apply nine conspicuous machine learning algorithms, namely k-nearest neighbors, logistic regression, SVM, naïve bayes, classification, and regression trees, random forest, multilayer perception, adaptive boosting, and gradient boosting machine on our processed data set and measure the performances of each of these classifiers in terms of some prominent performance metrics. Logistic regression is found outperforming all other classifiers in terms of all metrics used by attaining an accuracy approaching 97.91%. On the contrary, CART shows poor results of an accuracy approaching 59.37% after applying principal component analysis.

Download Full-text

AN EFFICIENT MACHINE LEARNING MODEL FOR PREDICTION OF ACUTE MYOCARDIAL INFARCTION

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666200325104317 ◽

2020 ◽

Vol 13 ◽

Author(s):

Dhilsath Fathima.M ◽

S. Justin Samuel ◽

R. Hari Haran

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Acute Myocardial Infarction ◽

Logistic Regression ◽

Decision Tree ◽

Learning Model ◽

Training Dataset ◽

Data Set ◽

Machine Learning Model ◽

Proposed Model

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.

Download Full-text

High performance logistic regression for privacy-preserving genome analysis

BMC Medical Genomics ◽

10.1186/s12920-020-00869-9 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Martine De Cock ◽

Rafael Dowsley ◽

Anderson C. A. Nascimento ◽

Davis Railsback ◽

Jianwei Shen ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Genome Analysis ◽

Local Area Network ◽

Local Area ◽

Activation Function ◽

Area Network ◽

Learning Models ◽

Data Set ◽

Machine Learning Models

Abstract Background In biomedical applications, valuable data is often split between owners who cannot openly share the data because of privacy regulations and concerns. Training machine learning models on the joint data without violating privacy is a major technology challenge that can be addressed by combining techniques from machine learning and cryptography. When collaboratively training machine learning models with the cryptographic technique named secure multi-party computation, the price paid for keeping the data of the owners private is an increase in computational cost and runtime. A careful choice of machine learning techniques, algorithmic and implementation optimizations are a necessity to enable practical secure machine learning over distributed data sets. Such optimizations can be tailored to the kind of data and Machine Learning problem at hand. Methods Our setup involves secure two-party computation protocols, along with a trusted initializer that distributes correlated randomness to the two computing parties. We use a gradient descent based algorithm for training a logistic regression like model with a clipped ReLu activation function, and we break down the algorithm into corresponding cryptographic protocols. Our main contributions are a new protocol for computing the activation function that requires neither secure comparison protocols nor Yao’s garbled circuits, and a series of cryptographic engineering optimizations to improve the performance. Results For our largest gene expression data set, we train a model that requires over 7 billion secure multiplications; the training completes in about 26.90 s in a local area network. The implementation in this work is a further optimized version of the implementation with which we won first place in Track 4 of the iDASH 2019 secure genome analysis competition. Conclusions In this paper, we present a secure logistic regression training protocol and its implementation, with a new subprotocol to securely compute the activation function. To the best of our knowledge, we present the fastest existing secure multi-party computation implementation for training logistic regression models on high dimensional genome data distributed across a local area network.

Download Full-text