Forecasting North America Winter Surface Air Temperature Using Machine Learning Methods

Author(s):  
Qifeng Qian ◽  
Xiaojing Jia ◽  
Hai Lin

<p>Two machine learning (ML) models (Support Vector Regression and Extreme Gradient Boosting; SVR and XGBoost hereafter) have been developed to perform seasonal forecast for the winter (December–January–February, DJF) surface air temperature (SAT) in North America (NA) in this study. The seasonal forecast skills of the two ML models are evaluated in a cross-validated fashion. Forecast results from one Linear Regression (LR and hereafter) model and two Canadian dynamic climate models are used for the purpose of a comparison. In the take-one-out hindcast experiment, the two ML models and the LR model show reasonable seasonal forecast skills for the winter SAT in NA. Comparing to the two Canadian dynamic models, the two ML models and the LR model have better forecast skill for the winter SAT over the central NA which mainly get contribution of a skillful forecast of the second Empirical Orthogonal Function (EOF) mode of winter SAT over NA. In general, the SVR model and XGBoost model hindcasts show better forecast performances than LR model. However, the LR model shows less dependence on the size of the training dataset than SVR and XGBoost models. In the real forecast experiments during the period 2011-2017, compared to the two Canadian dynamic climate models, the two ML models clearly improve the forecast skill of winter SAT over northern and central NA. The results of this study suggest that ML models may provide real-time supplementary forecast tools to improve the forecast skill and may operationally facilitate the seasonal forecast of the winter climate of NA. </p>

2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Ke Zhou ◽  
Hailei Liu ◽  
Xiaobo Deng ◽  
Hao Wang ◽  
Shenglan Zhang

Six machine-learning approaches, including multivariate linear regression (MLR), gradient boosting decision tree, k-nearest neighbors, random forest, extreme gradient boosting (XGB), and deep neural network (DNN), were compared for near-surface air-temperature (Tair) estimation from the new generation of Chinese geostationary meteorological satellite Fengyun-4A (FY-4A) observations. The brightness temperatures in split-window channels from the Advanced Geostationary Radiation Imager (AGRI) of FY-4A and numerical weather prediction data from the global forecast system were used as the predictor variables for Tair estimation. The performance of each model and the temporal and spatial distribution of the estimated Tair errors were analyzed. The results showed that the XGB model had better overall performance, with R2 of 0.902, bias of −0.087°C, and root-mean-square error of 1.946°C. The spatial variation characteristics of the Tair error of the XGB method were less obvious than those of the other methods. The XGB model can provide more stable and high-precision Tair for a large-scale Tair estimation over China and can serve as a reference for Tair estimation based on machine-learning models.


2021 ◽  
pp. 1-42
Author(s):  
QiFeng Qian ◽  
XiaoJing Jia ◽  
Hai Lin ◽  
Ruizhi Zhang

AbstractIn this study, four machine learning (ML) models (gradient boost decision tree (GBDT), light gradient boosting machine (LightGBM), categorical boosting (CatBoost) and extreme gradient boosting (XGBoost)) are used to perform seasonal forecasts for non-monsoonal winter precipitation over the Eurasian continent (30-60°N, 30-105°E) (NWPE). The seasonal forecast results from a traditional linear regression (LR) model and two dynamic models are compared. The ML and LR models are trained using the data for the period of 1979-2010, and then, these empirical models are used to perform the seasonal forecast of NWPE for 2011-2018. Our results show that the four ML models have reasonable seasonal forecast skills for the NWPE and clearly outperform the LR model. The ML models and the dynamic models have skillful forecasts for the NWPE over different regions. The ensemble means of the forecasts including the ML models and dynamic models show higher forecast skill for the NWEP than the ensemble mean of the dynamic-only models. The forecast skill of the ML models mainly benefits from a skillful forecast of the third empirical orthogonal function (EOF) mode (EOF3) of the NWPE, which has a good and consistent prediction among the ML models. Our results also illustrate that the sea ice over the Arctic in the previous autumn is the most important predictor in the ML models in forecasting the NWPE. This study suggests that ML models may be useful tools to help improve seasonal forecasts of the NWPE.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Moojung Kim ◽  
Young Jae Kim ◽  
Sung Jin Park ◽  
Kwang Gi Kim ◽  
Pyung Chun Oh ◽  
...  

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.


2006 ◽  
Vol 19 (13) ◽  
pp. 3279-3293 ◽  
Author(s):  
X. Quan ◽  
M. Hoerling ◽  
J. Whitaker ◽  
G. Bates ◽  
T. Xu

Abstract In this study the authors diagnose the sources for the contiguous U.S. seasonal forecast skill that are related to sea surface temperature (SST) variations using a combination of dynamical and empirical methods. The dynamical methods include ensemble simulations with four atmospheric general circulation models (AGCMs) forced by observed monthly global SSTs from 1950 to 1999, and ensemble AGCM experiments forced by idealized SST anomalies. The empirical methods involve a suite of reductions of the AGCM simulations. These include uni- and multivariate regression models that encapsulate the simultaneous and one-season lag linear connections between seasonal mean tropical SST anomalies and U.S. precipitation and surface air temperature. Nearly all of the AGCM skill in U.S. precipitation and surface air temperature, arising from global SST influences, can be explained by a single degree of freedom in the tropical SST field—that associated with the linear atmospheric signal of El Niño–Southern Oscillation (ENSO). The results support previous findings regarding the preeminence of ENSO as a U.S. skill source. The diagnostic methods used here exposed another skill source that appeared to be of non-ENSO origins. In late autumn, when the AGCM simulation skill of U.S. temperatures peaked in absolute value and in spatial coverage, the majority of that originated from SST variability in the subtropical west Pacific Ocean and the South China Sea. Hindcast experiments were performed for 1950–99 that revealed most of the simulation skill of the U.S. seasonal climate to be recoverable at one-season lag. The skill attributable to the AGCMs was shown to achieve parity with that attributable to empirical models derived purely from observational data. The diagnostics promote the interpretation that only limited advances in U.S. seasonal prediction skill should be expected from methods seeking to capitalize on sea surface predictors alone, and that advances that may occur in future decades could be readily masked by inherent multidecadal fluctuations in skill of coupled ocean–atmosphere systems.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Arturo Moncada-Torres ◽  
Marissa C. van Maaren ◽  
Mathijs P. Hendriks ◽  
Sabine Siesling ◽  
Gijs Geleijnse

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Hengrui Chen ◽  
Hong Chen ◽  
Ruiyu Zhou ◽  
Zhizhen Liu ◽  
Xiaoke Sun

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.


2021 ◽  
pp. 289-301
Author(s):  
B. Martín ◽  
J. González–Arias ◽  
J. A. Vicente–Vírseda

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.


2021 ◽  
Author(s):  
Seong Hwan Kim ◽  
Eun-Tae Jeon ◽  
Sungwook Yu ◽  
Kyungmi O ◽  
Chi Kyung Kim ◽  
...  

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.


Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


Sign in / Sign up

Export Citation Format

Share Document