Predicting Forest Fires using Supervised and Ensemble Machine Learning Algorithms

Forest fires have become one of the most frequently occurring disasters in recent years. The effects of forest fires have a lasting impact on the environment as it lead to deforestation and global warming, which is also one of its major cause of occurrence. Forest fires are dealt by collecting the satellite images of forest and if there is any emergency caused by the fires then the authorities are notified to mitigate its effects. By the time the authorities get to know about it, the fires would have already caused a lot of damage. Data mining and machine learning techniques can provide an efficient prevention approach where data associated with forests can be used for predicting the eventuality of forest fires. This paper uses the dataset present in the UCI machine learning repository which consists of physical factors and climatic conditions of the Montesinho park situated in Portugal. Various algorithms like Logistic regression, Support Vector Machine, Random forest, K-Nearest neighbors in addition to Bagging and Boosting predictors are used, both with and without Principal Component Analysis (PCA). Among the models in which PCA was applied, Logistic Regression gave the highest F-1 score of 68.26 and among the models where PCA was absent, Gradient boosting gave the highest score of 68.36.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Predicting CoVID-19 community mortality risk using machine learning and development of an online prognostic tool

PeerJ ◽

10.7717/peerj.10083 ◽

2020 ◽

Vol 8 ◽

pp. e10083 ◽

Cited By ~ 1

Author(s):

Ashis Kumar Das ◽

Shiba Mishra ◽

Saji Saraswathy Gopalan

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Open Source ◽

Mortality Risk ◽

Machine Learning Algorithms ◽

Brier Score ◽

Gradient Boosting ◽

Support Vector ◽

Prediction Tool ◽

Online Prediction

Background The recent pandemic of CoVID-19 has emerged as a threat to global health security. There are very few prognostic models on CoVID-19 using machine learning. Objectives To predict mortality among confirmed CoVID-19 patients in South Korea using machine learning and deploy the best performing algorithm as an open-source online prediction tool for decision-making. Materials and Methods Mortality for confirmed CoVID-19 patients (n = 3,524) between January 20, 2020 and May 30, 2020 was predicted using five machine learning algorithms (logistic regression, support vector machine, K nearest neighbor, random forest and gradient boosting). The performance of the algorithms was compared, and the best performing algorithm was deployed as an online prediction tool. Results The logistic regression algorithm was the best performer in terms of discrimination (area under ROC curve = 0.830), calibration (Matthews Correlation Coefficient = 0.433; Brier Score = 0.036) and. The best performing algorithm (logistic regression) was deployed as the online CoVID-19 Community Mortality Risk Prediction tool named CoCoMoRP (https://ashis-das.shinyapps.io/CoCoMoRP/). Conclusions We describe the development and deployment of an open-source machine learning tool to predict mortality risk among CoVID-19 confirmed patients using publicly available surveillance data. This tool can be utilized by potential stakeholders such as health providers and policymakers to triage patients at the community level in addition to other approaches.

Download Full-text

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets

Briefings in Bioinformatics ◽

10.1093/bib/bbaa321 ◽

2020 ◽

Author(s):

Zhenxing Wu ◽

Minfeng Zhu ◽

Yu Kang ◽

Elaine Lai-Han Leung ◽

Tailong Lei ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Support Vector Machine ◽

Gaussian Process Regression ◽

Principal Component ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Linear Svm

Abstract Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.

Download Full-text

A REVIEW ON MACHINE LEARNING TECHNIQUES FOR ADVANCED HEALTH CARE SYSTEMS

June-2020 - International Journal of Engineering Sciences & Research Technology ◽

10.29121/ijesrt.v9.i11.2020.1 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1-7

Keyword(s):

Machine Learning ◽

Health Care ◽

Logistic Regression ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor

Artificial intelligence is the technology that lets a machine mimic the thinking ability of a human being. Machine learning is the subset of AI, that makes this machine exhibit human behavior by making it learn from the known data, without the need of explicitly programming it. The health care sector has adopted this technology, for the development of medical procedures, maintaining huge patient’s records, assist physicians in the prediction, detection, and treatment of diseases and many more. In this paper, a comparative study of six supervised machine learning algorithms namely Logistic Regression(LR),support vector machine(SVM),Decision Tree(DT).Random Forest(RF),k-nearest neighbor(k-NN),Naive Bayes (NB) are made for the classification and prediction of diseases. Result shows out of compared supervised learning algorithms here, logistic regression is performing best with an accuracy of 81.4 % and the least performing is k-NN with just an accuracy of 69.01% in the classification and prediction of diseases.

Download Full-text

Diabetes Mellitus Prediction using Ensemble Machine Learning Techniques

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b3480.079220 ◽

2020 ◽

Vol 9 (2) ◽

pp. 312-316

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Principal Component ◽

Ensemble Classifier ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Healthcare Industry ◽

Blurred Vision ◽

The Way

The healthcare industry is inflicted with the plethora of patient data which is being supplemented each day manifold. Researchers have been continually using this data to help the healthcare industry improve upon the way major diseases could be handled. They are even working upon the way the patients could be informed timely of the symptoms that could avoid the major hazards related to them. Diabetes is one such disease that is growing at an alarming rate today. In fact, it can inflict numerous severe damages; blurred vision, myopia, burning extremities, kidney and heart failure. It occurs when sugar levels reach a certain threshold, or the human body cannot contain enough insulin to regulate the threshold. Therefore, patients affected by Diabetes must be informed so that proper treatments can be taken to control Diabetes. For this reason, early prediction and classification of Diabetes are significant. This work makes use of Machine Learning algorithms to improve the accuracy of prediction of the Diabetes. A dataset obtained as an output of K-Mean Clustering Algorithm was fed to an ensemble model with principal component analysis and K-means clustering. Our ensemble method produced only eight incorrectly classified instances, which was lowest compared to other methods. The experiments also showed that ensemble classifier models performed better than the base classifiers alone. Its result was compared with the same Dataset being applied on specific methods like random forest, Support Vector Machine, Decision Tree, Multilayer perceptron, and Naïve Bayes classification methods. All methods were run using 10k fold cross-validation.

Download Full-text

Classification of Lidar Measurements Using Supervised and Unsupervised Machine Learning Methods

10.5194/amt-2019-495 ◽

2020 ◽

Author(s):

Ghazal Farhani ◽

Robert J. Sica ◽

Mark Joseph Daley

Keyword(s):

Machine Learning ◽

Ad Hoc ◽

Signal To Noise Ratio ◽

Stratospheric Aerosol ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Lidar Measurements

Abstract. While it is relatively straightforward to automate the processing of lidar signals, it is more difficult to choose periods of "good" measurements to process. Groups use various ad hoc procedures involving either very simple (e.g. signal-to-noise ratio) or more complex procedures (e.g. Wing et al., 2018) to perform a task which is easy to train humans to perform but is time consuming. Here, we use machine learning techniques to train the machine to sort the measurements before processing. The presented methods is generic and can be applied to most lidars. We test the techniques using measurements from the Purple Crow Lidar (PCL) system located in London, Canada. The PCL has over 200,000 raw scans in Rayleigh and Raman channels available for classification. We classify raw (level-0) lidar measurements as "clear" sky scans with strong lidar returns, "bad" scans, and scans which are significantly influenced by clouds or aerosol loads. We examined different supervised machine learning algorithms including the random forest, the support vector machine, and the gradient boosting trees, all of which can successfully classify scans. The algorithms where trained using about 1500 scans for each PCL channel, selected randomly from different nights of measurements in different years. The success rate of identification, for all the channels is above 95 %. We also used the t-distributed Stochastic Embedding (t-SNE) method, which is an unsupervised algorithm, to cluster our lidar scans. Because the t-SNE is a data driven method in which no labelling of training set is needed, it is an attractive algorithm to find anomalies in lidar scans. The method has been tested on several nights of measurements from the PCL measurements.The t-SNE can successfully cluster the PCL data scans into meaningful categories. To demonstrate the use of the technique, we have used the algorithm to identify stratospheric aerosol layers due to wildfires.

Download Full-text

A Comparative Analysis of Machine Learning Techniques for Cyberbullying Detection on Twitter

Future Internet ◽

10.3390/fi12110187 ◽

2020 ◽

Vol 12 (11) ◽

pp. 187 ◽

Cited By ~ 1

Author(s):

Amgad Muneer ◽

Suliman Mohamed Fati

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Performance Metrics ◽

Machine Learning Techniques ◽

Stochastic Gradient Descent ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Global Issue ◽

Cyberbullying Detection

The advent of social media, particularly Twitter, raises many issues due to a misunderstanding regarding the concept of freedom of speech. One of these issues is cyberbullying, which is a critical global issue that affects both individual victims and societies. Many attempts have been introduced in the literature to intervene in, prevent, or mitigate cyberbullying; however, because these attempts rely on the victims’ interactions, they are not practical. Therefore, detection of cyberbullying without the involvement of the victims is necessary. In this study, we attempted to explore this issue by compiling a global dataset of 37,373 unique tweets from Twitter. Moreover, seven machine learning classifiers were used, namely, Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), Stochastic Gradient Descent (SGD), Random Forest (RF), AdaBoost (ADB), Naive Bayes (NB), and Support Vector Machine (SVM). Each of these algorithms was evaluated using accuracy, precision, recall, and F1 score as the performance metrics to determine the classifiers’ recognition rates applied to the global dataset. The experimental results show the superiority of LR, which achieved a median accuracy of around 90.57%. Among the classifiers, logistic regression achieved the best F1 score (0.928), SGD achieved the best precision (0.968), and SVM achieved the best recall (1.00).

Download Full-text

Prediction of Patient Readmission via Machine Learning Algorithms

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f7770.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 3226-3232

Keyword(s):

Machine Learning ◽

Hospital Readmission ◽

Machine Learning Algorithms ◽

High Rate ◽

Machine Learning Techniques ◽

Healthcare Sector ◽

Gradient Boosting ◽

Support Vector ◽

Linear Discriminant

Predicting the probability of hospital readmission is one of the most vital issues and is considered to be an important research area in the healthcare sector. For curing any of the diseases that might arise, there shall be some essential resources such as medical staff, expertise, beds and rooms. This secures getting excellent medical service. For example, heart failure (HF) or diabetes is a syndrome that could reduce the living quality of patients and has a serious influence on systems of healthcare. The previously mentioned diseases can result in high rate of readmission and hence high rate of costs as well. In this case, algorithms of machine learning are utilized to curb readmissions levels and improve the life quality of patients. Unluckily, a comparatively few numbers of researches in the literature endeavored to address this issue while a large proportion of researches were interested in predicting the probability of detecting diseases. Despite there is a plainly visible shortage on this topic, this paper seeks to spot most of the studies related to predict the probability of hospital readmission by the usage of machine learning techniques such as such as Logistic Regression (LR), Support Vector Machine (SVM), Artificial Neural Networks (ANNs), Linear Discriminant Analysis (LDA), Bayes algorithm, Random Forest (RF), Decision Trees (DTs), AdaBoost and Gradient Boosting (GB). Specifically, we explore the different techniques used in a medical area under the machine learning research field. In addition, we define four features that are used as criteria for an effective comparison among the employed techniques. These features include goal, data size, method, and performance. Furthermore, some recommendations are drawn from the comparison which is related to the selection of the best techniques in the medical field. Based on the outcomes of this research, it was found out that (bagging and DT) is the best technique to predict diabetes, whereas SVM is the best technique when it comes to prediction the breast cancer, and hospital readmission.

Download Full-text

Classification of lidar measurements using supervised and unsupervised machine learning methods

Atmospheric Measurement Techniques ◽

10.5194/amt-14-391-2021 ◽

2021 ◽

Vol 14 (1) ◽

pp. 391-402

Author(s):

Ghazal Farhani ◽

Robert J. Sica ◽

Mark Joseph Daley

Keyword(s):

Machine Learning ◽

Ad Hoc ◽

Signal To Noise Ratio ◽

Stratospheric Aerosol ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Lidar Measurements

Abstract. While it is relatively straightforward to automate the processing of lidar signals, it is more difficult to choose periods of “good” measurements to process. Groups use various ad hoc procedures involving either very simple (e.g. signal-to-noise ratio) or more complex procedures (e.g. Wing et al., 2018) to perform a task that is easy to train humans to perform but is time-consuming. Here, we use machine learning techniques to train the machine to sort the measurements before processing. The presented method is generic and can be applied to most lidars. We test the techniques using measurements from the Purple Crow Lidar (PCL) system located in London, Canada. The PCL has over 200 000 raw profiles in Rayleigh and Raman channels available for classification. We classify raw (level-0) lidar measurements as “clear” sky profiles with strong lidar returns, “bad” profiles, and profiles which are significantly influenced by clouds or aerosol loads. We examined different supervised machine learning algorithms including the random forest, the support vector machine, and the gradient boosting trees, all of which can successfully classify profiles. The algorithms were trained using about 1500 profiles for each PCL channel, selected randomly from different nights of measurements in different years. The success rate of identification for all the channels is above 95 %. We also used the t-distributed stochastic embedding (t-SNE) method, which is an unsupervised algorithm, to cluster our lidar profiles. Because the t-SNE is a data-driven method in which no labelling of the training set is needed, it is an attractive algorithm to find anomalies in lidar profiles. The method has been tested on several nights of measurements from the PCL measurements. The t-SNE can successfully cluster the PCL data profiles into meaningful categories. To demonstrate the use of the technique, we have used the algorithm to identify stratospheric aerosol layers due to wildfires.

Download Full-text

Prediction of addiction to drugs and alcohol using machine learning: A case study on Bangladeshi population

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i5.pp4471-4480 ◽

2021 ◽

Vol 11 (5) ◽

pp. 4471

Author(s):

Md. Ariful Islam Arif ◽

Saiful Islam Sany ◽

Farah Sharmin ◽

Md. Sadekur Rahman ◽

Md. Tarek Habib

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Performance Metrics ◽

Learning Algorithms ◽

Principal Component ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Data Set ◽

Adaptive Boosting ◽

Drugs And Alcohol

Nowadays addiction to drugs and alcohol has become a significant threat to the youth of the society as Bangladesh’s population. So, being a conscientious member of society, we must go ahead to prevent these young minds from life-threatening addiction. In this paper, we approach a machinelearning-based way to forecast the risk of becoming addicted to drugs using machine-learning algorithms. First, we find some significant factors for addiction by talking to doctors, drug-addicted people, and read relevant articles and write-ups. Then we collect data from both addicted and nonaddicted people. After preprocessing the data set, we apply nine conspicuous machine learning algorithms, namely k-nearest neighbors, logistic regression, SVM, naïve bayes, classification, and regression trees, random forest, multilayer perception, adaptive boosting, and gradient boosting machine on our processed data set and measure the performances of each of these classifiers in terms of some prominent performance metrics. Logistic regression is found outperforming all other classifiers in terms of all metrics used by attaining an accuracy approaching 97.91%. On the contrary, CART shows poor results of an accuracy approaching 59.37% after applying principal component analysis.

Download Full-text