Analysed potential of big data and supervised machine learning techniques in effectively forecasting travel times from fused data

Travel time forecasting is an interesting topic for many ITS services. Increased availability of data collection sensors increases the availability of the predictor variables but also highlights the high processing issues related to this big data availability. In this paper we aimed to analyse the potential of big data and supervised machine learning techniques in effectively forecasting travel times. For this purpose we used fused data from three data sources (Global Positioning System vehicles tracks, road network infrastructure data and meteorological data) and four machine learning techniques (k-nearest neighbours, support vector machines, boosting trees and random forest). To evaluate the forecasting results we compared them in-between different road classes in the context of absolute values, measured in minutes, and the mean squared percentage error. For the road classes with the high average speed and long road segments, machine learning techniques forecasted travel times with small relative error, while for the road classes with the small average speeds and segment lengths this was a more demanding task. All three data sources were proven itself to have a high impact on the travel time forecast accuracy and the best results (taking into account all road classes) were achieved for the k-nearest neighbours and random forest techniques.

Download Full-text

An Empirical Comparison of Six Supervised Machine Learning Techniques on Spark Platform for Health Big Data

Smart Intelligent Computing and Applications - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-13-1927-3_32 ◽

2018 ◽

pp. 299-307

Author(s):

Gayathri Nagarajan ◽

L. D. Dhinesh Babu

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Empirical Comparison ◽

Learning Techniques

Download Full-text

Comparative Analysis of Machine Learning Techniques to Identify Churn for Telecom Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.34.19210 ◽

2018 ◽

Vol 7 (3.34) ◽

pp. 291

Author(s):

M Malleswari ◽

R.J Manira ◽

Praveen Kumar ◽

Murugan .

Keyword(s):

Machine Learning ◽

Big Data ◽

Random Forest ◽

Decision Tree ◽

Apache Spark ◽

Machine Learning Techniques ◽

Churn Prediction ◽

Learning Techniques ◽

Boosted Tree ◽

Customer Attrition

Big data analytics has been the focus for large scale data processing. Machine learning and Big data has great future in prediction. Churn prediction is one of the sub domain of big data. Preventing customer attrition especially in telecom is the advantage of churn prediction. Churn prediction is a day-to-day affair involving millions. So a solution to prevent customer attrition can save a lot. This paper propose to do comparison of three machine learning techniques Decision tree algorithm, Random Forest algorithm and Gradient Boosted tree algorithm using Apache Spark. Apache Spark is a data processing engine used in big data which provides in-memory processing so that the processing speed is higher. The analysis is made by extracting the features of the data set and training the model. Scala is a programming language that combines both object oriented and functional programming and so a powerful programming language. The analysis is implemented using Apache Spark and modelling is done using scala ML. The accuracy of Decision tree model came out as 86%, Random Forest model is 87% and Gradient Boosted tree is 85%.

Download Full-text

Classification of Agriculture Farm Machinery Using Machine Learning and Internet of Things

Symmetry ◽

10.3390/sym13030403 ◽

2021 ◽

Vol 13 (3) ◽

pp. 403

Author(s):

Muhammad Waleed ◽

Tai-Won Um ◽

Tariq Kamal ◽

Syed Muhammad Usman

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Farm Machinery ◽

Learning Techniques

In this paper, we apply the multi-class supervised machine learning techniques for classifying the agriculture farm machinery. The classification of farm machinery is important when performing the automatic authentication of field activity in a remote setup. In the absence of a sound machine recognition system, there is every possibility of a fraudulent activity taking place. To address this need, we classify the machinery using five machine learning techniques—K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). For training of the model, we use the vibration and tilt of machinery. The vibration and tilt of machinery are recorded using the accelerometer and gyroscope sensors, respectively. The machinery included the leveler, rotavator and cultivator. The preliminary analysis on the collected data revealed that the farm machinery (when in operation) showed big variations in vibration and tilt, but observed similar means. Additionally, the accuracies of vibration-based and tilt-based classifications of farm machinery show good accuracy when used alone (with vibration showing slightly better numbers than the tilt). However, the accuracies improve further when both (the tilt and vibration) are used together. Furthermore, all five machine learning algorithms used for classification have an accuracy of more than 82%, but random forest was the best performing. The gradient boosting and random forest show slight over-fitting (about 9%), but both algorithms produce high testing accuracy. In terms of execution time, the decision tree takes the least time to train, while the gradient boosting takes the most time.

Download Full-text

Big Data Analysis Based on Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3630.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 4049-4056

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Web Application ◽

Big Data Analysis ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

User Reviews ◽

Mobile Web ◽

Learning Techniques

Most of the online applications such as Amazon, Snap deal, Flip cart and many others, attract customers by presenting user reviews about the services. These services typically include hotels, flights, cabs, holiday plans and many more. The main objective of this paper is to automatically analyze the feedbacks data given by the customers into positive, negative and neutral categories and gives a summarized review in case of multiple sentences is present in the feedback. In this proposed work various sources of data; namely from Flip cart, Snap deal is considered. The method to analyze the data include collecting the data from the mobile/web application sources, filtering the unwanted data, preprocessing and finally analyzing and summarizing the reviews using supervised machine learning techniques.

Download Full-text

Comparations of Supervised Machine Learning Techniques in Predicting the Classification of the Household’s Welfare Status

Journal Pekommas ◽

10.30818/jpkm.2019.2040105 ◽

2019 ◽

Vol 4 (1) ◽

pp. 43

Author(s):

Nfn Nofriani

Keyword(s):

Machine Learning ◽

Random Forest ◽

Social Assistance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Random Forest Algorithm ◽

K Nearest Neighbor ◽

Learning Techniques

Poverty has been a major problem for most countries around the world, including Indonesia. One approach to eradicate poverty is through equitable distribution of social assistance for target households based on Integrated Database of social assistance. This study has compared several well-known supervised machine learning techniques, namely: Naïve Bayes Classifier, Support Vector Machines, K-Nearest Neighbor Classification, C4.5 Algorithm, and Random Forest Algorithm to predict household welfare status classification by using an Integrated Database as a study case. The main objective of this study was to choose the best-supervised machine learning approach in predicting the classification of household’s welfare status based on attributes in the Integrated Database. The results showed that the Random Forest Algorithm was the best.

Download Full-text

Supervised machine learning based prediction of 2021 cross-national COVID-19 prevalence rates using historical infection data and socioeconomic indicators. (Preprint)

10.2196/preprints.35114 ◽

2021 ◽

Author(s):

Luke Winston ◽

Michael McCann ◽

George Onofrei

Keyword(s):

Machine Learning ◽

Socioeconomic Status ◽

Random Forest ◽

Linear Regression ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Prevalence Rates ◽

Socioeconomic Indicators ◽

Learning Techniques ◽

Years Of Schooling

BACKGROUND The COVID-19 pandemic represents the most unprecedented global challenge in recent times. As the global community attempts to manage the pandemic long-term, it is pivotal to understand what factors drive prevalence rates, and to predict the future trajectory of the virus. OBJECTIVE The aim of this study was to investigate whether socioeconomic indicators support in predicting year-on-year COVID-19 prevalence rates in a cross-sectional sample of 182 countries. Using a number of supervised machine learning techniques, results were evaluated and compared to determine the most accurate predictive algorithm. METHODS This research applied three supervised regression techniques: linear regression, random forest, and AdaBoost. Results were evaluated using k-fold cross validation and subsequently compared to analyse algorithmic suitability. The analysis involved two models. Firstly, the algorithms were trained to predict 2021 COVID-19 prevalence using only 2020 infection data. Following this, socioeconomic indicators were added as features and the algorithms were trained again. The Human Development Index metrics of life expectancy, mean years of schooling, expected years of schooling, and Gross National Income were used to approximate socioeconomic status. RESULTS Using 2020 infection prevalence rates as a lone predictor to predict 2021 prevalence rates, the average predictive accuracy of the algorithms was low (R2=0.562). When the socioeconomic indicators were added alongside 2020 prevalence rates as features, average predictive performance improved considerably (R2=0.724) and all error statistics decreased. This suggested that adding socioeconomic indicators alongside 2020 infection data optimised prediction of COVID-19 prevalence to a considerable degree. Linear regression was the strongest learner with R2=0.713 on the first model and R2=0.762 on the second model, followed by random forest (0.533 and 0.733) and AdaBoost (0.441 and 0.676). CONCLUSIONS Understanding the impact of socioeconomic status at national level will assist with future pandemic management. This paper puts forward new considerations about the application of machine learning techniques to understand and combat the COVID-19 pandemic.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Local mortality estimates during the COVID-19 pandemic in Italy

Journal of Population Economics ◽

10.1007/s00148-021-00857-y ◽

2021 ◽

Author(s):

Augusto Cerqua ◽

Roberta Di Stefano ◽

Marco Letta ◽

Sara Miccoli

Keyword(s):

Machine Learning ◽

Excess Mortality ◽

Control Method ◽

Local Level ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Mortality Data ◽

Official Method ◽

Learning Techniques ◽

Mortality Estimates

AbstractEstimates of the real death toll of the COVID-19 pandemic have proven to be problematic in many countries, Italy being no exception. Mortality estimates at the local level are even more uncertain as they require stringent conditions, such as granularity and accuracy of the data at hand, which are rarely met. The “official” approach adopted by public institutions to estimate the “excess mortality” during the pandemic draws on a comparison between observed all-cause mortality data for 2020 and averages of mortality figures in the past years for the same period. In this paper, we apply the recently developed machine learning control method to build a more realistic counterfactual scenario of mortality in the absence of COVID-19. We demonstrate that supervised machine learning techniques outperform the official method by substantially improving the prediction accuracy of the local mortality in “ordinary” years, especially in small- and medium-sized municipalities. We then apply the best-performing algorithms to derive estimates of local excess mortality for the period between February and September 2020. Such estimates allow us to provide insights about the demographic evolution of the first wave of the pandemic throughout the country. To help improve diagnostic and monitoring efforts, our dataset is freely available to the research community.

Download Full-text