Supervised machine learning based prediction of 2021 cross-national COVID-19 prevalence rates using historical infection data and socioeconomic indicators. (Preprint)

2021 ◽  
Author(s):  
Luke Winston ◽  
Michael McCann ◽  
George Onofrei

BACKGROUND The COVID-19 pandemic represents the most unprecedented global challenge in recent times. As the global community attempts to manage the pandemic long-term, it is pivotal to understand what factors drive prevalence rates, and to predict the future trajectory of the virus. OBJECTIVE The aim of this study was to investigate whether socioeconomic indicators support in predicting year-on-year COVID-19 prevalence rates in a cross-sectional sample of 182 countries. Using a number of supervised machine learning techniques, results were evaluated and compared to determine the most accurate predictive algorithm. METHODS This research applied three supervised regression techniques: linear regression, random forest, and AdaBoost. Results were evaluated using k-fold cross validation and subsequently compared to analyse algorithmic suitability. The analysis involved two models. Firstly, the algorithms were trained to predict 2021 COVID-19 prevalence using only 2020 infection data. Following this, socioeconomic indicators were added as features and the algorithms were trained again. The Human Development Index metrics of life expectancy, mean years of schooling, expected years of schooling, and Gross National Income were used to approximate socioeconomic status. RESULTS Using 2020 infection prevalence rates as a lone predictor to predict 2021 prevalence rates, the average predictive accuracy of the algorithms was low (R2=0.562). When the socioeconomic indicators were added alongside 2020 prevalence rates as features, average predictive performance improved considerably (R2=0.724) and all error statistics decreased. This suggested that adding socioeconomic indicators alongside 2020 infection data optimised prediction of COVID-19 prevalence to a considerable degree. Linear regression was the strongest learner with R2=0.713 on the first model and R2=0.762 on the second model, followed by random forest (0.533 and 0.733) and AdaBoost (0.441 and 0.676). CONCLUSIONS Understanding the impact of socioeconomic status at national level will assist with future pandemic management. This paper puts forward new considerations about the application of machine learning techniques to understand and combat the COVID-19 pandemic.

Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 403
Author(s):  
Muhammad Waleed ◽  
Tai-Won Um ◽  
Tariq Kamal ◽  
Syed Muhammad Usman

In this paper, we apply the multi-class supervised machine learning techniques for classifying the agriculture farm machinery. The classification of farm machinery is important when performing the automatic authentication of field activity in a remote setup. In the absence of a sound machine recognition system, there is every possibility of a fraudulent activity taking place. To address this need, we classify the machinery using five machine learning techniques—K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). For training of the model, we use the vibration and tilt of machinery. The vibration and tilt of machinery are recorded using the accelerometer and gyroscope sensors, respectively. The machinery included the leveler, rotavator and cultivator. The preliminary analysis on the collected data revealed that the farm machinery (when in operation) showed big variations in vibration and tilt, but observed similar means. Additionally, the accuracies of vibration-based and tilt-based classifications of farm machinery show good accuracy when used alone (with vibration showing slightly better numbers than the tilt). However, the accuracies improve further when both (the tilt and vibration) are used together. Furthermore, all five machine learning algorithms used for classification have an accuracy of more than 82%, but random forest was the best performing. The gradient boosting and random forest show slight over-fitting (about 9%), but both algorithms produce high testing accuracy. In terms of execution time, the decision tree takes the least time to train, while the gradient boosting takes the most time.


2015 ◽  
Vol 27 (6) ◽  
pp. 515-528 ◽  
Author(s):  
Ivana Šemanjski

Travel time forecasting is an interesting topic for many ITS services. Increased availability of data collection sensors increases the availability of the predictor variables but also highlights the high processing issues related to this big data availability. In this paper we aimed to analyse the potential of big data and supervised machine learning techniques in effectively forecasting travel times. For this purpose we used fused data from three data sources (Global Positioning System vehicles tracks, road network infrastructure data and meteorological data) and four machine learning techniques (k-nearest neighbours, support vector machines, boosting trees and random forest). To evaluate the forecasting results we compared them in-between different road classes in the context of absolute values, measured in minutes, and the mean squared percentage error. For the road classes with the high average speed and long road segments, machine learning techniques forecasted travel times with small relative error, while for the road classes with the small average speeds and segment lengths this was a more demanding task. All three data sources were proven itself to have a high impact on the travel time forecast accuracy and the best results (taking into account all road classes) were achieved for the k-nearest neighbours and random forest techniques.


2019 ◽  
Vol 4 (1) ◽  
pp. 43
Author(s):  
Nfn Nofriani

Poverty has been a major problem for most countries around the world, including Indonesia. One approach to eradicate poverty is through equitable distribution of social assistance for target households based on Integrated Database of social assistance. This study has compared several well-known supervised machine learning techniques, namely: Naïve Bayes Classifier, Support Vector Machines, K-Nearest Neighbor Classification, C4.5 Algorithm, and Random Forest Algorithm to predict household welfare status classification by using an Integrated Database as a study case. The main objective of this study was to choose the best-supervised machine learning approach in predicting the classification of household’s welfare status based on attributes in the Integrated Database. The results showed that the Random Forest Algorithm was the best.


2020 ◽  
Vol 28 (2) ◽  
pp. 253-265 ◽  
Author(s):  
Gabriela Bitencourt-Ferreira ◽  
Amauri Duarte da Silva ◽  
Walter Filgueira de Azevedo

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.


Author(s):  
Augusto Cerqua ◽  
Roberta Di Stefano ◽  
Marco Letta ◽  
Sara Miccoli

AbstractEstimates of the real death toll of the COVID-19 pandemic have proven to be problematic in many countries, Italy being no exception. Mortality estimates at the local level are even more uncertain as they require stringent conditions, such as granularity and accuracy of the data at hand, which are rarely met. The “official” approach adopted by public institutions to estimate the “excess mortality” during the pandemic draws on a comparison between observed all-cause mortality data for 2020 and averages of mortality figures in the past years for the same period. In this paper, we apply the recently developed machine learning control method to build a more realistic counterfactual scenario of mortality in the absence of COVID-19. We demonstrate that supervised machine learning techniques outperform the official method by substantially improving the prediction accuracy of the local mortality in “ordinary” years, especially in small- and medium-sized municipalities. We then apply the best-performing algorithms to derive estimates of local excess mortality for the period between February and September 2020. Such estimates allow us to provide insights about the demographic evolution of the first wave of the pandemic throughout the country. To help improve diagnostic and monitoring efforts, our dataset is freely available to the research community.


Author(s):  
Linwei Hu ◽  
Jie Chen ◽  
Joel Vaughan ◽  
Soroush Aramideh ◽  
Hanyu Yang ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document