A Comparative Analysis of Machine Learning Techniques for Cyberbullying Detection on Twitter

The advent of social media, particularly Twitter, raises many issues due to a misunderstanding regarding the concept of freedom of speech. One of these issues is cyberbullying, which is a critical global issue that affects both individual victims and societies. Many attempts have been introduced in the literature to intervene in, prevent, or mitigate cyberbullying; however, because these attempts rely on the victims’ interactions, they are not practical. Therefore, detection of cyberbullying without the involvement of the victims is necessary. In this study, we attempted to explore this issue by compiling a global dataset of 37,373 unique tweets from Twitter. Moreover, seven machine learning classifiers were used, namely, Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), Stochastic Gradient Descent (SGD), Random Forest (RF), AdaBoost (ADB), Naive Bayes (NB), and Support Vector Machine (SVM). Each of these algorithms was evaluated using accuracy, precision, recall, and F1 score as the performance metrics to determine the classifiers’ recognition rates applied to the global dataset. The experimental results show the superiority of LR, which achieved a median accuracy of around 90.57%. Among the classifiers, logistic regression achieved the best F1 score (0.928), SGD achieved the best precision (0.968), and SVM achieved the best recall (1.00).

Download Full-text

Predicting Forest Fires using Supervised and Ensemble Machine Learning Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2878.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3697-3705 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Forest Fires ◽

Principal Component ◽

Climatic Conditions ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Physical Factors

Forest fires have become one of the most frequently occurring disasters in recent years. The effects of forest fires have a lasting impact on the environment as it lead to deforestation and global warming, which is also one of its major cause of occurrence. Forest fires are dealt by collecting the satellite images of forest and if there is any emergency caused by the fires then the authorities are notified to mitigate its effects. By the time the authorities get to know about it, the fires would have already caused a lot of damage. Data mining and machine learning techniques can provide an efficient prevention approach where data associated with forests can be used for predicting the eventuality of forest fires. This paper uses the dataset present in the UCI machine learning repository which consists of physical factors and climatic conditions of the Montesinho park situated in Portugal. Various algorithms like Logistic regression, Support Vector Machine, Random forest, K-Nearest neighbors in addition to Bagging and Boosting predictors are used, both with and without Principal Component Analysis (PCA). Among the models in which PCA was applied, Logistic Regression gave the highest F-1 score of 68.26 and among the models where PCA was absent, Gradient boosting gave the highest score of 68.36.

Download Full-text

Cardiac Disease Prediction using Supervised Machine Learning Techniques.

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012013 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012013

Author(s):

Chiradeep Gupta ◽

Athina Saha ◽

N V Subba Reddy ◽

U Dinesh Acharya

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Cardiac Disease ◽

Performance Metrics ◽

Confusion Matrix ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Ensemble Techniques ◽

Learning Techniques

Abstract Diagnosis of cardiac disease requires being more accurate, precise, and reliable. The number of death cases due to cardiac attacks is increasing exponentially day by day. Thus, practical approaches for earlier diagnosis of cardiac or heart disease are done to achieve prompt management of the disease. Various supervised machine learning techniques like K-Nearest Neighbour, Decision Tree, Logistic Regression, Naïve Bayes, and Support Vector Machine (SVM) model are used for predicting cardiac disease using a dataset that was collected from the repository of the University of California, Irvine (UCI). The results depict that Logistic Regression was better than all other supervised classifiers in terms of the performance metrics. The model is also less risky since the number of false negatives is low as compared to other models as per the confusion matrix of all the models. In addition, ensemble techniques can be approached for the accuracy improvement of the classifier. Jupyter notebook is the best tool, for the implementation of Python Programming having many types of libraries, header files, for accurate and precise work.

Download Full-text

Prediction of River Stage Using Multistep-Ahead Machine Learning Techniques for a Tidal River of Taiwan

Water ◽

10.3390/w13070920 ◽

2021 ◽

Vol 13 (7) ◽

pp. 920

Author(s):

Wen-Dar Guo ◽

Wei-Bo Chen ◽

Sen-Hai Yeh ◽

Chih-Hsin Chang ◽

Hongey Chen

Keyword(s):

Machine Learning ◽

Flood Control ◽

Tidal River ◽

Bayesian Optimization ◽

Data Driven ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

River Stage ◽

Light Gradient

Time-series prediction of a river stage during typhoons or storms is essential for flood control or flood disaster prevention. Data-driven models using machine learning (ML) techniques have become an attractive and effective approach to modeling and analyzing river stage dynamics. However, relatively new ML techniques, such as the light gradient boosting machine regression (LGBMR), have rarely been applied to predict the river stage in a tidal river. In this study, data-driven ML models were developed under a multistep-ahead prediction framework and evaluated for river stage modeling. Four ML techniques, namely support vector regression (SVR), random forest regression (RFR), multilayer perceptron regression (MLPR), and LGBMR, were employed to establish data-driven ML models with Bayesian optimization. The models were applied to simulate river stage hydrographs of the tidal reach of the Lan-Yang River Basin in Northeastern Taiwan. Historical measurements of rainfall, river stages, and tidal levels were collected from 2004 to 2017 and used for training and validation of the four models. Four scenarios were used to investigate the effect of the combinations of input variables on river stage predictions. The results indicated that (1) the tidal level at a previous stage significantly affected the prediction results; (2) the LGBMR model achieves more favorable prediction performance than the SVR, RFR, and MLPR models; and (3) the LGBMR model could efficiently and accurately predict the 1–6-h river stage in the tidal river. This study provides an extensive and insightful comparison of four data-driven ML models for river stage forecasting that can be helpful for model selection and flood mitigation.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Predictive models for stage and risk classification in head and neck squamous cell carcinoma (HNSCC)

PeerJ ◽

10.7717/peerj.9656 ◽

2020 ◽

Vol 8 ◽

pp. e9656

Author(s):

Sugandh Kumar ◽

Srinivas Patnaik ◽

Anshuman Dixit

Keyword(s):

Machine Learning ◽

Expression Profiles ◽

Disease Process ◽

Penalized Regression ◽

Functional Enrichment ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Sequencing Data ◽

Therapeutic Modalities

Machine learning techniques are increasingly used in the analysis of high throughput genome sequencing data to better understand the disease process and design of therapeutic modalities. In the current study, we have applied state of the art machine learning (ML) algorithms (Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM)) to stratify the HNSCC patients in early and late clinical stages (TNM) and to predict the risk using miRNAs expression profiles. A six miRNA signature was identified that can stratify patients in the early and late stages. The mean accuracy, sensitivity, specificity, and area under the curve (AUC) was found to be 0.84, 0.87, 0.78, and 0.82, respectively indicating the robust performance of the generated model. The prognostic signature of eight miRNAs was identified using LASSO (least absolute shrinkage and selection operator) penalized regression. These miRNAs were found to be significantly associated with overall survival of the patients. The pathway and functional enrichment analysis of the identified biomarkers revealed their involvement in important cancer pathways such as GP6 signalling, Wnt signalling, p53 signalling, granulocyte adhesion, and dipedesis. To the best of our knowledge, this is the first such study and we hope that these signature miRNAs will be useful for the risk stratification of patients and the design of therapeutic modalities.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Hybrid approach with Deep Auto-Encoder and optimized LSTM based Deep Learning approach to detect anomaly in cloud logs

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201707 ◽

2021 ◽

pp. 1-15

Author(s):

Savaridassan Pankajashan ◽

G. Maragatham ◽

T. Kirthiga Devi

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Anomaly Detection ◽

Performance Metrics ◽

Hybrid Approach ◽

Machine Learning Techniques ◽

Support Vector ◽

Paper Machine ◽

Log Data ◽

Isolation Forest

Anomaly-based detection is coupled with recognizing the uncommon, to catch the unusual activity, and to find the strange action behind that activity. Anomaly-based detection has a wide scope of critical applications, from bank application security to regular sciences to medical systems to marketing apps. Anomaly-based detection adopted by various Machine Learning techniques is really a type of system that consists of artificial intelligence. With the ever-expanding volume and new sorts of information, for example, sensor information from an incontestably enormous amount of IoT devices and from network flow data from cloud computing, it is implicitly understood without surprise that there is a developing enthusiasm for having the option to deal with more conclusions automatically by means of AI and ML applications. But with respect to anomaly detection, many applications of the scheme are simply the passion for detection. In this paper, Machine Learning (ML) techniques, namely the SVM, Isolation forest classifiers experimented and with reference to Deep Learning (DL) techniques, the proposed DA-LSTM (Deep Auto-Encoder LSTM) model are adopted for preprocessing of log data and anomaly-based detection to get better performance measures of detection. An enhanced LSTM (long-short-term memory) model, optimizing for the suitable parameter using a genetic algorithm (GA), is utilized to recognize better the anomaly from the log data that is filtered, adopting a Deep Auto-Encoder (DA). The Deep Neural network models are utilized to change over unstructured log information to training ready features, which are reasonable for log classification in detecting anomalies. These models are assessed, utilizing two benchmark datasets, the Openstack logs, and CIDDS-001 intrusion detection OpenStack server dataset. The outcomes acquired show that the DA-LSTM model performs better than other notable ML techniques. We further investigated the performance metrics of the ML and DL models through the well-known indicator measurements, specifically, the F-measure, Accuracy, Recall, and Precision. The exploratory conclusion shows that the Isolation Forest, and Support vector machine classifiers perform roughly 81%and 79%accuracy with respect to the performance metrics measurement on the CIDDS-001 OpenStack server dataset while the proposed DA-LSTM classifier performs around 99.1%of improved accuracy than the familiar ML algorithms. Further, the DA-LSTM outcomes on the OpenStack log data-sets show better anomaly detection compared with other notable machine learning models.

Download Full-text

Using Random Forests on Real-World City Data for Urban Planning in a Visual Semantic Decision Support System

Sensors ◽

10.3390/s19102266 ◽

2019 ◽

Vol 19 (10) ◽

pp. 2266 ◽

Cited By ~ 1

Author(s):

Nikolaos Sideris ◽

Georgios Bardis ◽

Athanasios Voulodimos ◽

Georgios Miaoulis ◽

Djamchid Ghazanfarpour

Keyword(s):

Machine Learning ◽

Urban Planning ◽

Random Forests ◽

Real World ◽

Performance Metrics ◽

World City ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Real World Data

The constantly increasing amount and availability of urban data derived from varying sources leads to an assortment of challenges that include, among others, the consolidation, visualization, and maximal exploitation prospects of the aforementioned data. A preeminent problem affecting urban planning is the appropriate choice of location to host a particular activity (either commercial or common welfare service) or the correct use of an existing building or empty space. In this paper, we propose an approach to address these challenges availed with machine learning techniques. The proposed system combines, fuses, and merges various types of data from different sources, encodes them using a novel semantic model that can capture and utilize both low-level geometric information and higher level semantic information and subsequently feeds them to the random forests classifier, as well as other supervised machine learning models for comparisons. Our experimental evaluation on multiple real-world data sets comparing the performance of several classifiers (including Feedforward Neural Networks, Support Vector Machines, Bag of Decision Trees, k-Nearest Neighbors and Naïve Bayes), indicated the superiority of Random Forests in terms of the examined performance metrics (Accuracy, Specificity, Precision, Recall, F-measure and G-mean).

Download Full-text

Machine learning versus logistic regression methods for 2-year mortality prognostication in a small, heterogeneous glioma database

10.1101/472555 ◽

2018 ◽

Cited By ~ 2

Author(s):

Sandip S Panesar ◽

Rhett N D’Souza ◽

Fang-Cheng Yeh ◽

Juan C Fernandez-Miranda

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Machine Learning Techniques ◽

World Health ◽

Support Vector ◽

Molecular Characteristics ◽

Regression Methods ◽

Learning Techniques ◽

The World ◽

Health Organization

AbstractBackgroundMachine learning (ML) is the application of specialized algorithms to datasets for trend delineation, categorization or prediction. ML techniques have been traditionally applied to large, highly-dimensional databases. Gliomas are a heterogeneous group of primary brain tumors, traditionally graded using histopathological features. Recently the World Health Organization proposed a novel grading system for gliomas incorporating molecular characteristics. We aimed to study whether ML could achieve accurate prognostication of 2-year mortality in a small, highly-dimensional database of glioma patients.MethodsWe applied three machine learning techniques: artificial neural networks (ANN), decision trees (DT), support vector machine (SVM), and classical logistic regression (LR) to a dataset consisting of 76 glioma patients of all grades. We compared the effect of applying the algorithms to the raw database, versus a database where only statistically significant features were included into the algorithmic inputs (feature selection).ResultsRaw input consisted of 21 variables, and achieved performance of (accuracy/AUC): 70.7%/0.70 for ANN, 68%/0.72 for SVM, 66.7%/0.64 for LR and 65%/0.70 for DT. Feature selected input consisted of 14 variables and achieved performance of 73.4%/0.75 for ANN, 73.3%/0.74 for SVM, 69.3%/0.73 for LR and 65.2%/0.63 for DT.ConclusionsWe demonstrate that these techniques can also be applied to small, yet highly-dimensional datasets. Our ML techniques achieved reasonable performance compared to similar studies in the literature. Though local databases may be small versus larger cancer repositories, we demonstrate that ML techniques can still be applied to their analysis, though traditional statistical methods are of similar benefit.

Download Full-text

A Comparative Analysis of Enhanced Machine Learning Algorithms for Smart Grid Stability Prediction

10.36227/techrxiv.16863145.v1 ◽

2021 ◽

Author(s):

ANKIT GHOSH ◽

ALOK KOLE

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Smart Grid ◽

Smart Grids ◽

Machine Learning Algorithms ◽

Electricity Sector ◽

Stochastic Gradient Descent ◽

Energy Output ◽

Gradient Boosting ◽

Support Vector

<p>Smart grid is an essential concept in the transformation of the electricity sector into an intelligent digitalized energy network that can deliver optimal energy from the source to the consumers. Smart grids being self-sufficient systems are constructed through the integration of information, telecommunication, and advanced power technologies with the existing electricity systems. Artificial Intelligence (AI) is an important technology driver in smart grids. The application of AI techniques in smart grid is becoming more apparent because the traditional modelling optimization and control techniques have their own limitations. Machine Learning (ML) being a sub-set of AI enables intelligent decision-making and response to sudden changes in the customer energy demands, unexpected disruption of power supply, sudden variations in renewable energy output or any other catastrophic events in a smart grid. This paper presents the comparison among some of the state-of-the-art ML algorithms for predicting smart grid stability. The dataset that has been selected contains results from simulations of smart grid stability. Enhanced ML algorithms such as Support Vector Machine (SVM), Logistic Regression, K-Nearest Neighbour (KNN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF), Stochastic Gradient Descent (SGD) classifier, XGBoost and Gradient Boosting classifiers have been implemented to forecast smart grid stability. A comparative analysis among the different ML models has been performed based on the following evaluation metrics such as accuracy, precision, recall, F1-score, AUC-ROC, and AUC-PR curves. The test results that have been obtained have been quite promising with the XGBoost classifier outperforming all the other models with an accuracy of 97.5%, recall of 98.4%, precision of 97.6%, F1-score of 97.9%, AUC-ROC of 99.8% and AUC-PR of 99.9%. </p>

Download Full-text