Prediction of West Nile Virus using Ensemble Classifiers

West Nile Virus (WNV) is a disease caused by mosquitoes where human beings get infected by the mosquito’s bite. The disease is considered to be a serious threat to the society especially in the United States where it is frequently found in localities having water bodies. The traditional approach is to collect the traps of mosquitoes from a locality and check whether they are infected with virus. If there is a virus found then that locality is sprayed with pesticides. But this process is very time consuming and requires a lot of financial support. Machine learning methods can provide an efficient approach to predict the presence of virus in a locality using data related to the location and weather. This paper uses the dataset present in Kaggle which includes information related to the traps found in the locality and also about the information related to the locality’s weather. The dataset is found to be imbalanced hence Synthetic Minority Over sampling Technique (SMOTE), an upsampling method, is used to sample the dataset to balance it. Ensemble learning classifiers like random forest, gradient boosting and Extreme Gradient Boosting (XGB). The performance of ensemble classifiers is compared with the performance of the best supervised learning algorithm, SVM. Among the models, XGB gave the highest F-1 score of 92.93 by performing marginally better than random forest (92.78) and also SVM (91.16).

Download Full-text

Modeling and analysis of COVID-19 new deaths using tree-based ensemble

10.36227/techrxiv.16566012.v1 ◽

2021 ◽

Author(s):

Ibrahim Abaker Targio Hashem ◽

Raja Sher Afgun Usmani ◽

Asad Ali Shah ◽

Abdulwahab Ali Almazroi ◽

Muhammad Bilal

Keyword(s):

Infectious Disease ◽

United States ◽

Random Forest ◽

Economic Activity ◽

The United States ◽

Gradient Boosting ◽

Health Crisis ◽

Modeling And Analysis ◽

Extreme Gradient Boosting ◽

The World

The COVID-19 pandemic has emerged as the world's most serious health crisis, affecting millions of people all over the world. The majority of nations have imposed nationwide curfews and reduced economic activity to combat the spread of this infectious disease. Governments are monitoring the situation and making critical decisions based on the daily number of new cases and deaths reported. Therefore, this study aims to predict the daily new deaths using four tree-based ensemble models i.e., Gradient Tree Boosting (GB), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Voting Regressor (VR) for the three most affected countries, which are the United States, Brazil, and India. The results showed that VR outperformed other models in predicting daily new deaths for all three countries. The predictions of daily new deaths made using VR for Brazil and India are very close to the actual new deaths, whereas the prediction of daily new deaths for the United States still needs to be improved.<br>

Download Full-text

Towards Optimization of Malware Detection using Extra-Tree and Random Forest Feature Selections on Ensemble Classifiers

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f5545.039621 ◽

2021 ◽

Vol 9 (6) ◽

pp. 223-232

Author(s):

Fadare Oluwaseun Gbenga ◽

Adetunmbi Adebayo Olusola ◽

Oyinloye Oghenerukevwe Elohor

Keyword(s):

Feature Selection ◽

Random Forest ◽

Communication Systems ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Detection Accuracy ◽

Ensemble Classifiers ◽

Ensemble Techniques ◽

Study Results ◽

Extreme Gradient Boosting

The proliferation of Malware on computer communication systems posed great security challenges to confidential data stored and other valuable substances across the globe. There have been several attempts in curbing the menace using a signature-based approach and in recent times, machine learning techniques have been extensively explored. This paper proposes a framework combining the exploit of both feature selections based on extra tree and random forest and eight ensemble techniques on five base learners- KNN, Naive Bayes, SVM, Decision Trees, and Logistic Regression. K-Nearest Neighbors returns the highest accuracy of 96.48%, 96.40%, and 87.89% on extra-tree, random forest, and without feature selection (WFS) respectively. Random forest ensemble accuracy on both Feature Selections are the highest with 98.50% and 98.16% on random forest and extra-tree respectively. The Extreme Gradient Boosting Classifier is next on random-forest FS with an accuracy of 98.37% while Voting returns the least detection accuracy of 95.80%. On extra-tree FS, Bagging is next with a detection accuracy of 98.09% while Voting returns the least accuracy of 95.54%. Random Forest has the highest all in seven evaluative measures in both extra tree and random forest feature selection techniques. The study results uncover the tree-based ensemble model is proficient and successful for malware classification.

Download Full-text

Coastal Wetland Mapping Using Ensemble Learning Algorithms: A Comparative Study of Bagging, Boosting and Stacking Techniques

Remote Sensing ◽

10.3390/rs12101683 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1683

Author(s):

Li Wen ◽

Michael Hughes

Keyword(s):

Remote Sensing ◽

Random Forest ◽

Coastal Wetland ◽

River Estuary ◽

Gradient Boosting ◽

Wetland Mapping ◽

Ensemble Classifiers ◽

Coastal Landscape ◽

Extreme Gradient Boosting ◽

Wetland Distribution

Coastal wetlands are a critical component of the coastal landscape that are increasingly threatened by sea level rise and other human disturbance. Periodically mapping wetland distribution is crucial to coastal ecosystem management. Ensemble algorithms (EL), such as random forest (RF) and gradient boosting machine (GBM) algorithms, are now commonly applied in the field of remote sensing. However, the performance and potential of other EL methods, such as extreme gradient boosting (XGBoost) and bagged trees, are rarely compared and tested for coastal wetland mapping. In this study, we applied the three most widely used EL techniques (i.e., bagging, boosting and stacking) to map wetland distribution in a highly modified coastal catchment, the Manning River Estuary, Australia. Our results demonstrated the advantages of using ensemble classifiers to accurately map wetland types in a coastal landscape. Enhanced bagging decision trees, i.e., classifiers with additional methods to increasing ensemble diversity such as RF and weighted subspace random forest, had comparably high predictive power. For the stacking method evaluated in this study, our results are inconclusive, and further comprehensive quantitative study is encouraged. Our findings also suggested that the ensemble methods were less effective at discriminating minority classes in comparison with more common classes. Finally, the variable importance results indicated that hydro-geomorphic factors, such as tidal depth and distance to water edge, were among the most influential variables across the top classifiers. However, vegetation indices derived from longer time series of remote sensing data that arrest the full features of land phenology are likely to improve wetland type separation in coastal areas.

Download Full-text

On-Ground Distributed COVID-19 Variant Intelligent Data Analytics for a Regional Territory

Wireless Communications and Mobile Computing ◽

10.1155/2021/1679835 ◽

2021 ◽

Vol 2021 ◽

pp. 1-19

Author(s):

Umrah Zadi Khuhawar ◽

Isma Farah Siddiqui ◽

Qasim Ali Arain ◽

Mokhi Maan Siddiqui ◽

Nawab Muhammad Faseeh Qureshi

Keyword(s):

Random Forest ◽

Linear Regression ◽

Supervised Learning ◽

Prediction Models ◽

Learning Algorithm ◽

High Accuracy ◽

Estimation Methods ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Short Term Prediction

The onset of the COVID-19 pandemic and the subsequent transmission among communities has made the entire human population extremely vulnerable. Due to the virus’s contagiousness, the most powerful economies in the world are struggling with the inadequacies of resources. As the number of cases continues to rise and the healthcare industry is overwhelmed with the increasing needs of the infected population, there is a requirement to estimate the potential future number of cases using prediction methods. This paper leverages data-driven estimation methods such as linear regression (LR), random forest (RF), and XGBoost (extreme gradient boosting) algorithm. All three algorithms are trained using the COVID-19 data of Pakistan from 24 February to 31 December 2020, wherein the daily resolution is integrated. Essentially, this paper postulates that, with the help of values of new positive cases, medical swabs, daily death, and daily new positive cases, it is possible to predict the progression of the COVID-19 pandemic and demonstrate future trends. Linear regression tends to oversimplify concepts in supervised learning and neglect practical challenges present in the real world, often cited as its primary disadvantage. In this paper, we use an enhanced random forest algorithm. It is a supervised learning algorithm that is used for classification. This algorithm works well for an extensive range of data items, and also it is very flexible and possesses very high accuracy. For higher accuracy, we have also implemented the XGBoost algorithm on the dataset. XGBoost is a newly introduced machine learning algorithm; this algorithm provides high accuracy of prediction models, and it is observed that it performs well in short-term prediction. This paper discusses various factors such as total COVID-19 cases, new cases per day, total COVID-19 related deaths, new deaths due to the COVID-19, the total number of recoveries, number of daily recoveries, and swabs through the proposed technique. This paper presents an innovative approach that assists health officials in Pakistan with their decision-making processes.

Download Full-text

Modeling and analysis of COVID-19 new deaths using tree-based ensemble

10.36227/techrxiv.16566012 ◽

2021 ◽

Author(s):

Ibrahim Abaker Targio Hashem ◽

Raja Sher Afgun Usmani ◽

Asad Ali Shah ◽

Abdulwahab Ali Almazroi ◽

Muhammad Bilal

Keyword(s):

Infectious Disease ◽

United States ◽

Random Forest ◽

Economic Activity ◽

The United States ◽

Gradient Boosting ◽

Health Crisis ◽

Modeling And Analysis ◽

Extreme Gradient Boosting ◽

The World

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

338. Analysis of Etiologies of Aseptic Meningitis within a Nation-Wide Hospital Network

Open Forum Infectious Diseases ◽

10.1093/ofid/ofaa439.533 ◽

2020 ◽

Vol 7 (Supplement_1) ◽

pp. S239-S239

Author(s):

Arunmozhi S Aravagiri ◽

Scott Kubomoto ◽

Ayutyanont Napatkamon ◽

Sarah Wilson ◽

Sudhakar Mallela

Keyword(s):

United States ◽

West Nile Virus ◽

Aseptic Meningitis ◽

Epidemiologic Study ◽

The United States ◽

Positive Test ◽

Diagnostic Tools ◽

West Nile ◽

Varicella Zoster ◽

The Us

Abstract Background Aseptic meningitis can be caused by an array of microorganisms, both bacterial and non-bacterial, as well as non-infectious conditions. Some etiologies of aseptic meningitis require treatment with antibiotics, antiviral, antifungals, anti-parasitic agents, immunosuppressants, and or chemotherapy. There are limited diagnostic tools for diagnosing certain types of aseptic meningitis, therefore knowing the differential causes of aseptic meningitis, and their relative percentages may assist in diagnosis. Review of the literature reveals that there are no recent studies of etiologies of aseptic meningitis in the United States (US). This is an epidemiologic study to delineate etiologies of aseptic meningitis in a large database of 185 HCA hospitals across the US. Methods Data was collected from January 2016 to December 2019 on all patients diagnosed with meningitis. CSF PCR studies, and CSF antibody tests were then selected for inclusion. Results Total number of encounters were 3,149 hospitalizations. Total number of individual labs analyzed was 10,613, and of these 262 etiologies were identified. 23.6% (62) of cases were due to enterovirus, 18.7% (49) due to HSV-2, 14.5% (38) due to West Nile virus, 13.7% (36) due to Varicella zoster (VZV), 10.5% (27) due to Cryptococcus. Additionally, we analyzed the rate of positive test results by region. Nationally, 9.7% of tests ordered for enterovirus were positive. In contrast, 0.5% of tests ordered for HSV 1 were positive. The southeastern United States had the highest rate of positive tests for HSV 2 (7% of tests ordered for HSV 2 were positive). The central United States had the highest rate of positive test for West Nile virus (11% of tests ordered for West Nile were positive). The northeastern region and the highest rate of positive tests for varicella zoster (18%). Table 1: Percentage of positive CSF tests (positive tests/tests ordered) Table 2: Lists the number of HIV patients and transplant patients that had positive CSF PCR/serologies Figure 1: Percentage of positive CSF tests in each region Conclusion Approximately 40% of aseptic meningitis population had treatable etiologies. A third of the Cryptococcus meningitis population had HIV. Furthermore, enteroviruses had the majority of cases within the US, which are similar to studies done in other parts of the world. Disclosures All Authors: No reported disclosures

Download Full-text

Development and validation of a difficult laryngoscopy prediction model using machine learning of neck circumference and thyromental height

BMC Anesthesiology ◽

10.1186/s12871-021-01343-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jong Ho Kim ◽

Haewon Kim ◽

Ji Su Jang ◽

Sung Mi Hwang ◽

So Young Lim ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confidence Interval ◽

Neck Circumference ◽

Difficult Laryngoscopy ◽

Gradient Boosting ◽

Test Set ◽

Equal Distribution ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.

Download Full-text

Meteorological Conditions Associated with Increased Incidence of West Nile Virus Disease in the United States, 2004–2012

American Journal of Tropical Medicine and Hygiene ◽

10.4269/ajtmh.14-0737 ◽

2015 ◽

Vol 92 (5) ◽

pp. 1013-1022 ◽

Cited By ~ 38

Author(s):

Micah B. Hahn ◽

Roger S. Nasci ◽

Mark J. Delorey ◽

Rebecca J. Eisen ◽

Andrew J. Monaghan ◽

...

Keyword(s):

United States ◽

West Nile Virus ◽

Virus Disease ◽

The United States ◽

Meteorological Conditions ◽

West Nile

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text