Comparative analysis of methods for prediction continuous numerical features on big datasets

The object of research is the process of choosing a method for predicting continuous numerical features on big datasets. The importance of the study is due to the fact that today in various subject areas it is necessary to solve the problem of predicting performance indicators based on data collected from different sources and presented in different formats, which is the task of big data analysis. To solve the problem, the methods of statistical analysis were considered, namely multiple linear regression, decision trees and a random forest. An array of extensive data was built without specifying the subject area, its preliminary processing, analysis was carried out to establish the correlation between the features. The processing of the big data array was carried out using the technology of parallel computing by means of the Dask library of the Python language. Since working with big data requires significant computing resources, this approach does not require the use of powerful computer technology. Prediction models were built using multiple linear regression methods, decision trees and a random forest, visualization of the prediction results and analysis of the reliability of the constructed models. Based on the results of calculating the prediction error, it was found that the greatest prediction accuracy among the considered methods is the random forest method. When applying this method, the prediction accuracy for a dataset of numerical features was approximately 97 %, which indicates a high reliability of the constructed model. Thus, it is possible to conclude that the random forest method is suitable for solving prediction problems using large data sets, it can be used for datasets with a large number of features and is not sensitive to data scaling. The developed software application in Python can be used to predict numerical features from different subject areas, the prediction results are imported into a text file.

Download Full-text

Forecasting primary delay recovery of high-speed railway using multiple linear regression, supporting vector machine, artificial neural network, and random forest regression

Canadian Journal of Civil Engineering ◽

10.1139/cjce-2017-0642 ◽

2019 ◽

Vol 46 (5) ◽

pp. 353-363 ◽

Cited By ~ 6

Author(s):

Chaozhe Jiang ◽

Ping Huang ◽

Javad Lessan ◽

Liping Fu ◽

Chao Wen

Keyword(s):

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Prediction Accuracy ◽

High Speed ◽

Support Vector ◽

Random Forest Regression ◽

High Speed Railway ◽

Buffer Time ◽

Artificial Neural

Accurate prediction of recoverable train delay can support the train dispatchers’ decision-making with timetable rescheduling and improving service reliability. In this paper, we present the results of an effort aimed to develop primary delay recovery (PDR) predictor model using train operation records from Wuhan-Guangzhou (W-G) high-speed railway. To this end, we first identified the main variables that contribute to delay, including dwell buffer time, running buffer time, magnitude of primary delay time, and individual sections’ influence. Different models are applied and calibrated to predict the PDR. The validation results on test datasets indicate that the random forest regression (RFR) model outperforms the other three alternative models, namely, multiple linear regression (MLR), support vector machine (SVM), and artificial neural networks (ANN) regarding prediction accuracy measure. Specifically, the evaluation results show that when the prediction tolerance is less than 1 min, the RFR model can achieve up to 80.4% of prediction accuracy, while the accuracy level is 44.4%, 78.5%, and 78.5% for MLR, SVM, and ANN models, respectively.

Download Full-text

Prediction of Blended Yarn Evenness and Tensile Properties by Using Artificial Neural Network and Multiple Linear Regression

Autex Research Journal ◽

10.1515/aut-2015-0018 ◽

2016 ◽

Vol 16 (2) ◽

pp. 43-50 ◽

Cited By ~ 8

Author(s):

Samander Ali Malik ◽

Assad Farooq ◽

Thomas Gereke ◽

Chokri Cherif

Keyword(s):

Neural Networks ◽

Linear Regression ◽

Multiple Linear Regression ◽

Tensile Properties ◽

Regression Models ◽

Prediction Models ◽

Research Work ◽

Absolute Error ◽

Knowledge Domain ◽

Artificial Neural

Abstract The present research work was carried out to develop the prediction models for blended ring spun yarn evenness and tensile parameters using artificial neural networks (ANNs) and multiple linear regression (MLR). Polyester/cotton blend ratio, twist multiplier, back roller hardness and break draft ratio were used as input parameters to predict yarn evenness in terms of CVm% and yarn tensile properties in terms of tenacity and elongation. Feed forward neural networks with Bayesian regularisation support were successfully trained and tested using the available experimental data. The coefficients of determination of ANN and regression models indicate that there is a strong correlation between the measured and predicted yarn characteristics with an acceptable mean absolute error values. The comparative analysis of two modelling techniques shows that the ANNs perform better than the MLR models. The relative importance of input variables was determined using rank analysis through input saliency test on optimised ANN models and standardised coefficients of regression models. These models are suitable for yarn manufacturers and can be used within the investigated knowledge domain.

Download Full-text

COMPARISON OF RANDOM FOREST AND MULTIPLE LINEAR REGRESSION TO MODEL THE MASS BALANCE OF BIOSOLIDS FROM A COMPLEX BIOSOLIDS MANAGEMENT AREA

Water Environment Research ◽

10.1002/wer.1668 ◽

2021 ◽

Author(s):

Thaís Bremm Pluth ◽

Dominic A. Brose

Keyword(s):

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Mass Balance ◽

Management Area

Download Full-text

Comparison of Prediction Accuracy of Multiple Linear Regression, ARIMA and ARIMAX Model for Pest Incidence of Cotton with Weather Factors

Madras Agricultural Journal ◽

10.29321/maj.2018.000151 ◽

2018 ◽

Vol 105 (7-9) ◽

Author(s):

V. S. Aswathi ◽

M. R. Duraisamy

Keyword(s):

Linear Regression ◽

Multiple Linear Regression ◽

Prediction Accuracy ◽

Weather Factors ◽

Pest Incidence

Download Full-text

Descriptive and Predictive Analytical Methods for Big Data

Web Services ◽

10.4018/978-1-5225-7501-6.ch018 ◽

2019 ◽

pp. 314-331 ◽

Cited By ~ 1

Author(s):

Sema A. Kalaian ◽

Rafa M. Kasim ◽

Nabeel R. Kasim

Keyword(s):

Big Data ◽

Standard Deviation ◽

Linear Regression ◽

Multiple Linear Regression ◽

Knowledge Discovery ◽

Data Visualization ◽

Analytical Methods ◽

Data Analytics ◽

Enterprise Performance ◽

Analytical Tools

Data analytics and modeling are powerful analytical tools for knowledge discovery through examining and capturing the complex and hidden relationships and patterns among the quantitative variables in the existing massive structured Big Data in efforts to predict future enterprise performance. The main purpose of this chapter is to present a conceptual and practical overview of some of the basic and advanced analytical tools for analyzing structured Big Data. The chapter covers descriptive and predictive analytical methods. Descriptive analytical tools such as mean, median, mode, variance, standard deviation, and data visualization methods (e.g., histograms, line charts) are covered. Predictive analytical tools for analyzing Big Data such as correlation, simple- and multiple- linear regression are also covered in the chapter.

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Multiple linear regression and random forest to predict and map soil properties using data from portable X-ray fluorescence spectrometer (pXRF)

Ciência e Agrotecnologia ◽

10.1590/1413-70542017416010317 ◽

2017 ◽

Vol 41 (6) ◽

pp. 648-664 ◽

Cited By ~ 30

Author(s):

Sérgio Henrique Godinho Silva ◽

Anita Fernanda dos Santos Teixeira ◽

Michele Duarte de Menezes ◽

Luiz Roberto Guimarães Guilherme ◽

Fatima Maria de Souza Moreira ◽

...

Keyword(s):

Random Forest ◽

Linear Regression ◽

Soil Properties ◽

Multiple Linear Regression ◽

Low Cost ◽

High Accuracy ◽

Important Variable ◽

X Ray ◽

Element Contents ◽

Fluorescence Spectrometer

ABSTRACT Determination of soil properties helps in the correct management of soil fertility. The portable X-ray fluorescence spectrometer (pXRF) has been recently adopted to determine total chemical element contents in soils, allowing soil property inferences. However, these studies are still scarce in Brazil and other countries. The objectives of this work were to predict soil properties using pXRF data, comparing stepwise multiple linear regression (SMLR) and random forest (RF) methods, as well as mapping and validating soil properties. 120 soil samples were collected at three depths and submitted to laboratory analyses. pXRF was used in the samples and total element contents were determined. From pXRF data, SMLR and RF were used to predict soil laboratory results, reflecting soil properties, and the models were validated. The best method was used to spatialize soil properties. Using SMLR, models had high values of R² (≥0.8), however the highest accuracy was obtained in RF modeling. Exchangeable Ca, Al, Mg, potential and effective cation exchange capacity, soil organic matter, pH, and base saturation had adequate adjustment and accurate predictions with RF. Eight out of the 10 soil properties predicted by RF using pXRF data had CaO as the most important variable helping predictions, followed by P2O5, Zn and Cr. Maps generated using RF from pXRF data had high accuracy for six soil properties, reaching R2 up to 0.83. pXRF in association with RF can be used to predict soil properties with high accuracy at low cost and time, besides providing variables aiding digital soil mapping.

Download Full-text

Bruise susceptibilities of kiwifruit as affected by impact and fruit properties

Research in Agricultural Engineering ◽

10.17221/57/2011-rae ◽

2012 ◽

Vol 58 (No. 3) ◽

pp. 107-113 ◽

Cited By ~ 19

Author(s):

E. Ahmadi

Keyword(s):

Linear Regression ◽

Multiple Linear Regression ◽

Fruit Quality ◽

Prediction Models ◽

Radius Of Curvature ◽

Regression Analyses ◽

Probability Level ◽

Absorbed Energy ◽

Bruise Susceptibility ◽

Damage Susceptibility

Kiwifruit bruise damage is a common postharvest disorder that substantially reduces fruit quality and marketability. Fruit bruise cause tissue softening and make them more susceptible to undesired agents such as diseases-inducing agents. Factors that affect kiwifruit bruise susceptibility such as impact properties and fruit properties were investigated. Two bruise prediction models were constructed for the damage susceptibility of kiwifruit (measured by absorbed energy) using multiple linear regression analyses. Kiwifruits were subjected to dynamic loading by means of a pendulum at three levels of impact. Significant effects of acoustic stiffness, temperature and the radius of curvature and some interactions on bruising were obtained at 5% probability level.

Download Full-text