Model Evaluation for Forecasting Traffic Accident Severity in Rainy Seasons Using Machine Learning Algorithms: Seoul City Study

There have been numerous studies on traffic accidents and their severity, particularly in relation to weather conditions and road geometry. In these studies, traditional statistical methods have been employed, such as linear regression, logistic regression, and negative binomial regression modeling, which are the most common linear and non-linear regression analysis methods. In this research, machine learning architecture was applied to this problem using the random forest, artificial neural network, and decision tree techniques to ascertain the strengths and weaknesses of these methods. Three data sets were used: road geometry data, precipitation data, and traffic accident data over nine years corresponding to the Naebu Expressway, which is located in Seoul, Korea. For the model evaluation, three measures were employed: the out-of-bag estimate of error rate (OOB), mean square error (MSE), and root mean square error (RMSE). The low mean OOB, MSE, and RMSE observed in the results obtained using the proposed random forest model demonstrate its accuracy.

Download Full-text

Predicting the Mechanical Properties of RCA-Based Concrete Using Supervised Machine Learning Algorithms

Materials ◽

10.3390/ma15020647 ◽

2022 ◽

Vol 15 (2) ◽

pp. 647

Author(s):

Meijun Shang ◽

Hejun Li ◽

Ayaz Ahmad ◽

Waqas Ahmad ◽

Krzysztof Adam Ostrowski ◽

...

Keyword(s):

Machine Learning ◽

Mechanical Properties ◽

Mean Square Error ◽

Coarse Aggregate ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Environmental Damage ◽

Fine Aggregate ◽

Mean Square ◽

The Impact

Environment-friendly concrete is gaining popularity these days because it consumes less energy and causes less damage to the environment. Rapid increases in the population and demand for construction throughout the world lead to a significant deterioration or reduction in natural resources. Meanwhile, construction waste continues to grow at a high rate as older buildings are destroyed and demolished. As a result, the use of recycled materials may contribute to improving the quality of life and preventing environmental damage. Additionally, the application of recycled coarse aggregate (RCA) in concrete is essential for minimizing environmental issues. The compressive strength (CS) and splitting tensile strength (STS) of concrete containing RCA are predicted in this article using decision tree (DT) and AdaBoost machine learning (ML) techniques. A total of 344 data points with nine input variables (water, cement, fine aggregate, natural coarse aggregate, RCA, superplasticizers, water absorption of RCA and maximum size of RCA, density of RCA) were used to run the models. The data was validated using k-fold cross-validation and the coefficient correlation coefficient (R2), mean square error (MSE), mean absolute error (MAE), and root mean square error values (RMSE). However, the model’s performance was assessed using statistical checks. Additionally, sensitivity analysis was used to determine the impact of each variable on the forecasting of mechanical properties.

Download Full-text

Forecasting the Market with Machine Learning Algorithms: An Application of NMC-BERT-LSTM-DQN-X Algorithm in Quantitative Trading

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488378 ◽

2022 ◽

Vol 16 (4) ◽

pp. 1-22

Author(s):

Chang Liu ◽

Jie Yan ◽

Feiyue Guo ◽

Min Guo

Keyword(s):

Machine Learning ◽

Stock Market ◽

Mean Square Error ◽

Short Term Memory ◽

The State ◽

Machine Learning Algorithms ◽

Future Market ◽

Mean Square ◽

Market Data ◽

Market Trends

Although machine learning (ML) algorithms have been widely used in forecasting the trend of stock market indices, they failed to consider the following crucial aspects for market forecasting: (1) that investors’ emotions and attitudes toward future market trends have material impacts on market trend forecasting (2) the length of past market data should be dynamically adjusted according to the market status and (3) the transition of market statutes should be considered when forecasting market trends. In this study, we proposed an innovative ML method to forecast China's stock market trends by addressing the three issues above. Specifically, sentimental factors (see Appendix [1] for full trans) were first collected to measure investors’ emotions and attitudes. Then, a non-stationary Markov chain (NMC) model was used to capture dynamic transitions of market statutes. We choose the state-of-the-art (SOTA) method, namely, Bidirectional Encoder Representations from Transformers ( BERT ), to predict the state of the market at time t , and a long short-term memory ( LSTM ) model was used to estimate the varying length of past market data in market trend prediction, where the input of LSTM (the state of the market at time t ) was the output of BERT and probabilities for opening and closing of the gates in the LSTM model were based on outputs of the NMC model. Finally, the optimum parameters of the proposed algorithm were calculated using a reinforced learning-based deep Q-Network. Compared to existing forecasting methods, the proposed algorithm achieves better results with a forecasting accuracy of 61.77%, annualized return of 29.25%, and maximum losses of −8.29%. Furthermore, the proposed model achieved the lowest forecasting error: mean square error (0.095), root mean square error (0.0739), mean absolute error (0.104), and mean absolute percent error (15.1%). As a result, the proposed market forecasting model can help investors obtain more accurate market forecast information.

Download Full-text

Interpolation of Instantaneous Air Temperature Using Geographical and MODIS Derived Variables with Machine Learning Techniques

10.20944/preprints201906.0008.v1 ◽

2019 ◽

Author(s):

Marcos Ruiz-Álvarez ◽

Francisco Alonso-Sarría ◽

Francisco Gomariz-Castillo

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Linear Regression ◽

Air Temperature ◽

Satellite Data ◽

Multivariate Linear Regression ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector

Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machine and Random Forest, are compared with Multivariate Linear Regression, TVX and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using four different statistics on a daily basis allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest with residual kriging produces the best results (R$^2$=0.612 $\pm$ 0.019, NSE=0.578 $\pm$ 0.025, RMSE=1.068 $\pm$ 0.027, PBIAS=-0.172 $\pm$ 0.046), whereas TVX produces the least accurate results. The environmental conditions in the study area are not really suited to TVX, moreover this method only takes into account satellite data. On the other hand, regression methods (Support Vector Machine, Random Forest and Multivariate Linear Regression) use several parameters that are easily calculated from a Digital Elevation Model, adding very little difficulty to the use of satellite data alone. The most important variables in the Random Forest Model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.

Download Full-text

PSXV-29 Late-Breaking: Investigating use of machine learning algorithms to predict days in herd for commercial beef cattle

Journal of Animal Science ◽

10.1093/jas/skab235.698 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 381-381

Author(s):

Ghader Manafiazar ◽

Mohammad Riazi ◽

John A Basarab ◽

Changxi Li ◽

Paul Stothard ◽

...

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Phenotypic Variation ◽

Model Performance ◽

Genomic Analysis ◽

Machine Learning Algorithms ◽

Phenotypic Variance ◽

Mean Square ◽

Simple Linear Regression ◽

Genomic Information

Abstract The objective of this study was to explore the potential of Machine Learning (ML) algorithms to increase the accuracy of predicting individual days in the herd (an indicator of stayability) using reproductive records and genomic information. A total of 6943 cows from 3 herds with reproductive performance were included in the study from which 696 cows had genomic information (genotyped using Illumina Bovine 50k SNP BeadChip). Different libraries based on R and Python were used to test various ML models including Lazy Predict, Scikit-learn, PyCaret, and H2O Flow. Genomic information was subjected to quality control by removing SNPs with an allele frequency less than 0.05 or with a call rate lower than 0.95. A total of 42,689 SNP remained for further analysis and accounted for 11% of phenotypic variation (heritability of 0.11±0.02) in DIH. Different numbers of SNPs (500 SNPs, 1K, 5K, 10K, and 15K) were selected based on their contribution to phenotypic variation from GWAS and were included in the models. Model performance measures, such as mean absolute error (MAE) and mean square of error (MSE), worsened with increased SNPs in the model. Bayesian Ridge algorithm using 500 top SNPs contributed to the phenotypic variance, had the best performance to predict DIH with MAE of 612.6 and R2 of 0.52 in the training population using PyCaret program. When BWT and WWT were added to the model, in addition to SNPs, little change was observed in the model’s performance. Overall, we concluded that ML models had better performance compared to the conventional modeling approach and genomic analysis; CatBoost model had 55% lower mean square of error compared to the simple linear regression (734650 vs 1637410). The results suggest that ML tools have the potential to improve the accuracy of predicting DIH compared to simple linear regression and conventional genomic analysis.

Download Full-text

APPLICATION OF MACHINE LEARNING IN ANALYSING HISTORICAL AND NON-HISTORICAL CHARACTERISTICS OF HERITAGE PRE-WAR SHOPHOUSES

PLANNING MALAYSIA JOURNAL ◽

10.21837/pm.v19i16.953 ◽

2021 ◽

Vol 19 (16) ◽

Author(s):

Nur Shahirah Ja'afar ◽

Junainah Mohamad

Keyword(s):

Machine Learning ◽

Random Forest ◽

Root Mean Square Error ◽

Current Practice ◽

Machine Learning Algorithms ◽

Mean Square ◽

Lasso Regression ◽

North East ◽

Best Fitting ◽

Fitting Model

Real estate is complex and its value is influenced by many characteristics. However, the current practice in Malaysia shows that historical characteristics have not been given primary consideration in determining the value of heritage properties. Thus, the accuracy of the values produced is questionable. This paper aims to determine whether the historical characteristics of the pre-war shophouses at North-East Penang Island, Malaysia contribute any significance to their value. Several Machine Learning algorithms have been developed for this purpose namely Random Forest, Decision Tree, Lasso Regression, Ridge Regression and Linear Regression. The result shows that the Random Forest Regressor with historical characteristics is the best fitting model with higher values of R-squared (R²) and lowest value of Root Mean Square Error (RMSE). This indicates that the historical characteristics of the heritage property under study contribute to its significant value. By considering the historical characteristics, the property’s value can be better predicted.

Download Full-text

Interpolation of Instantaneous Air Temperature Using Geographical and MODIS Derived Variables with Machine Learning Techniques

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8090382 ◽

2019 ◽

Vol 8 (9) ◽

pp. 382 ◽

Cited By ~ 2

Author(s):

Marcos Ruiz-Álvarez ◽

Francisco Alonso-Sarria ◽

Francisco Gomariz-Castillo

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Air Temperature ◽

Cross Validation ◽

Daily Basis ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector

Several methods have been tried to estimate air temperature using satellite imagery. In this paper, the results of two machine learning algorithms, Support Vector Machines and Random Forest, are compared with Multiple Linear Regression and Ordinary kriging. Several geographic, remote sensing and time variables are used as predictors. The validation is carried out using two different approaches, a leave-one-out cross validation in the spatial domain and a spatio-temporal k-block cross-validation, and four different statistics on a daily basis, allowing the use of ANOVA to compare the results. The main conclusion is that Random Forest produces the best results (R 2 = 0.888 ± 0.026, Root mean square error = 3.01 ± 0.325 using k-block cross-validation). Regression methods (Support Vector Machine, Random Forest and Multiple Linear Regression) are calibrated with MODIS data and several predictors easily calculated from a Digital Elevation Model. The most important variables in the Random Forest model were satellite temperature, potential irradiation and cdayt, a cosine transformation of the julian day.

Download Full-text

Predicting Future Products Rate using Machine Learning Algorithms

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2020.05.04 ◽

2020 ◽

Vol 12 (5) ◽

pp. 41-51

Author(s):

Shaimaa Mahmoud ◽

◽

Mahmoud Hussein ◽

Arabi Keshk

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Mean Squared Error ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Regression ◽

Data Set ◽

Squared Error

Opinion mining in social networks data is considered as one of most important research areas because a large number of users interact with different topics on it. This paper discusses the problem of predicting future products rate according to users’ comments. Researchers interacted with this problem by using machine learning algorithms (e.g. Logistic Regression, Random Forest Regression, Support Vector Regression, Simple Linear Regression, Multiple Linear Regression, Polynomial Regression and Decision Tree). However, the accuracy of these techniques still needs to be improved. In this study, we introduce an approach for predicting future products rate using LR, RFR, and SVR. Our data set consists of tweets and its rate from 1:5. The main goal of our approach is improving the prediction accuracy about existing techniques. SVR can predict future product rate with a Mean Squared Error (MSE) of 0.4122, Linear Regression model predict with a Mean Squared Error of 0.4986 and Random Forest Regression can predict with a Mean Squared Error of 0.4770. This is better than the existing approaches accuracy.

Download Full-text

Machine Learning Based Power Estimation for CMOS VLSI Circuits

10.21203/rs.3.rs-723965/v1 ◽

2021 ◽

Author(s):

V. Govindaraj ◽

B. Arunadevi

Keyword(s):

Machine Learning ◽

Random Forest ◽

Mean Square Error ◽

Power Estimation ◽

Vlsi Circuits ◽

Coefficient Of Determination ◽

Accurate Estimation ◽

Mean Square ◽

Nsga Ii ◽

Cmos Vlsi

Abstract Nowdays, machine learning (ML) algorithms are receiving massive attention in most of the engineering application since it has capability in complex systems modelling using historical data. Estimation of power for CMOS VLSI circuit using various circuit attributes is proposed using passive machine learning based technique. The proposed method uses supervised learning method which provides a fast and accurate estimation of power without affecting the accuracy of the system. Power estimation using random forest algorithm is relatively new. Accurate estimation of power of CMOS VLSI circuits is estimated by using random forest model which is optimized and tuned by using multi-objective NSGA-II algorithm. It is inferred from the experimental results testing error varies from 1.4 percent to 6.8 percent and in terms of and Mean Square Error is 1.46e-06 in random forest method when compared to BPNN. Statistical estimation like coefficient of determination (𝑅) and Root Mean Square Error (RMSE) are done and it is proven that random Forest is best choice for power estimation of CMOS VLSI circuits with high coefficient of determination of 0.99938. and low RMSE of 0.000116.

Download Full-text

Machine Learning For Prognosis of Life Expectancy and Diseases

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.j9156.0881019 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1765-1771

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Life Expectancy ◽

Multiple Linear Regression ◽

Machine Learning Algorithms ◽

Economic Factors ◽

Average Life Expectancy ◽

Average Percentage ◽

Hiv Aids

Longevity depends on various facets such as economic growth of the country, along with the health innovations of the region. Along with the prophecy of existence, we also figure out how sensitive a particular mainland is to few chronic diseases. These factors have a robust impact on the potential life span of the population. We study the biological and economical aspects of continents and their countries to predict the life expectancy of the population and to perceive the probability of the continent possessing long standing diseases like measles, HIV/AIDS, etc. Our research is conducted on the theory that exhibits the dependency or correlation of life expectancy with the various factors which includes the health factors as well as the economic factors. Two Machine learning algorithms simple linear regression, multiple linear regression are used for predicting the expectancy of life over different continents, whereas, decision tree algorithm, random forest algorithm, and were applied to classify the likelihood of occurrence of the disease. On comparing and contrasting various algorithms, we can infer that, multiple linear regression produces the most accurate results as to what the average life expectancy of the population would be given the current features of the continent like the adult mortality rate, alcohol consumption rate, infant deaths, the GDP of the country, average percentage expenditure of the population on health care and treatments, schooling rate, and other such features. On the other hand, we study five diseases namely, HIV/AIDS, measles, diphtheria, hepatitis B and polio. The experiment concluded that, on majority, random forest produces better results of classification based on the economic factors of the combination of various countries of different continents

Download Full-text

Reinforced XGBoost machine learning model for sustainable intelligent agrarian applications

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200862 ◽

2020 ◽

Vol 39 (5) ◽

pp. 7605-7620 ◽

Cited By ~ 1

Author(s):

Dhivya Elavarasan ◽

Durai Raj Vincent

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Mean Square Error ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

The Other ◽

Gradient Boosting ◽

Model Assessment ◽

Mean Square ◽

Extreme Gradient Boosting

The development in science and technical intelligence has incited to represent an extensive amount ofdata from various fields of agriculture. Therefore an objective rises up for the examination of the available data and integrating with processes like crop enhancement, yield prediction, examination of plant infections etc. Machine learning has up surged with tremendous processing techniques to perceive new contingencies in the multi-disciplinary agrarian advancements. In this pa- per a novel hybrid regression algorithm, reinforced extreme gradient boosting is proposed which displays essentially improved execution over traditional machine learning algorithms like artificial neural networks, deep Q-Network, gradient boosting, ran- dom forest and decision tree. Extreme gradient boosting constructs new models, which are essentially, decision trees learning from the mistakes of their predecessors by optimizing the gradient descent loss function. The proposed hybrid model performs reinforcement learning at every node during the node splitting process of the decision tree construction. This leads to effective utilizationofthesamplesbyselectingtheappropriatesplitattributeforenhancedperformance. Model’sperformanceisevaluated by means of Mean Square Error, Root Mean Square Error, Mean Absolute Error, and Coefficient of Determination. To assure a fair assessment of the results, the model assessment is performed on both training and test dataset. The regression diagnostic plots from residuals and the results obtained evidently delineates the fact that proposed hybrid approach performs better with reduced error measure and improved accuracy of 94.15% over the other machine learning algorithms. Also the performance of probability density function for the proposed model delineates that, it can preserve the actual distributional characteristics of the original crop yield data more approximately when compared to the other experimented machine learning models.

Download Full-text