A Review on Linear Regression Comprehensive in Machine Learning

Perhaps one of the most common and comprehensive statistical and machine learning algorithms are linear regression. Linear regression is used to find a linear relationship between one or more predictors. The linear regression has two types: simple regression and multiple regression (MLR). This paper discusses various works by different researchers on linear regression and polynomial regression and compares their performance using the best approach to optimize prediction and precision. Almost all of the articles analyzed in this review is focused on datasets; in order to determine a model's efficiency, it must be correlated with the actual values obtained for the explanatory variables.

Download Full-text

Cycling performance prediction based on cadence analysis by using multiple regression

Journal of Physics Conference Series ◽

10.1088/1742-6596/2107/1/012058 ◽

2021 ◽

Vol 2107 (1) ◽

pp. 012058

Author(s):

Sukhairi Sudin ◽

Azizi Naim Abdul Aziz ◽

Fathinul Syahir Ahmad Saad ◽

Nurul Syahirah Khalid ◽

Ismail Ishaq Ibrahim

Keyword(s):

Machine Learning ◽

Heart Rate ◽

Regression Analysis ◽

Linear Regression ◽

Linear Relationship ◽

Multiple Regression ◽

Cycling Performance ◽

Continuous Output ◽

Independent Variable ◽

Prediction Problems

Abstract This project examined the influence of the cadence, speed, heart rate and power towards the cycling performance by using Garmin Edge 1000. Any change in cadence will affect the speed, heart rate and power of the novice cyclist and the changes pattern will be observed through mobile devices installed with Garmin Connect application. Every results will be recorded for the next task which analysis the collected data by using machine learning algorithm which is Regression analysis. Regression analysis is a statistical method for modelling the connection between one or more independent variables and a dependent (target) variable. Regression analysis is required to answer these types of prediction problems in machine learning. Regression is a supervised learning technique that aids in the discovery of variable correlations and allows for the prediction of a continuous output variable based on one or more predictor variables. A total of forty days’ worth of events were captured in the dataset. Cadence act as dependent variable, (y) while speed, heart rate and power act as independent variable, (x) in prediction of the cycling performance. Simple linear regression is defined as linear regression with only one input variable (x). When there are several input variables, the linear regression is referred to as multiple linear regression. The research uses a linear regression technique to predict cycling performance based on cadence analysis. The linear regression algorithm reveals a linear relationship between a dependent (y) variable and one or more independent (y) variables, thus the name. Because linear regression reveals a linear relationship, it determines how the value of the dependent variable changes as the value of the independent variable changes. This analysis use the Mean Squared Error (MSE) expense function for Linear Regression, which is the average of squared errors between expected and real values. Value of R squared had been recorded in this project. A low R-squared value means that the independent variable is not describing any of the difference in the dependent variable-regardless of variable importance, this is letting know that the defined independent variable, although meaningful, is not responsible for much of the variance in the dependent variable’s mean. By using multiple regression, the value of R-squared in this project is acceptable because over than 0.7 and as known this project based on human behaviour and usually the R-squared value hardly to have more than 0.3 if involve human factor but in this project the R-squared is acceptable.

Download Full-text

A Daily Covid-19 Cases Prediction System using Data Mining and Machine Learning Algorithm

10.5121/csit.2021.112320 ◽

2021 ◽

Author(s):

Yiqi Jack Gao ◽

Yu Sun

Keyword(s):

Machine Learning ◽

Random Forest ◽

Hospital Admissions ◽

Polynomial Regression ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Policy Makers ◽

Diverse Range ◽

Using Data

The start of 2020 marked the beginning of the deadly COVID-19 pandemic caused by the novel SARS-COV-2 from Wuhan, China. As of the time of writing, the virus had infected over 150 million people worldwide and resulted in more than 3.5 million global deaths. Accurate future predictions made through machine learning algorithms can be very useful as a guide for hospitals and policy makers to make adequate preparations and enact effective policies to combat the pandemic. This paper carries out a two pronged approach to analyzing COVID-19. First, the model utilizes the feature significance of random forest regressor to select eight of the most significant predictors (date, new tests, weekly hospital admissions, population density, total tests, total deaths, location, and total cases) for predicting daily increases of Covid-19 cases, highlighting potential target areas in order to achieve efficient pandemic responses. Then it utilizes machine learning algorithms such as linear regression, polynomial regression, and random forest regression to make accurate predictions of daily COVID-19 cases using a combination of this diverse range of predictors and proved to be competent at generating predictions with reasonable accuracy.

Download Full-text

Eagle View: An Abstract Evaluation of Machine Learning Algorithms based on Data Properties

10.36227/techrxiv.14459361.v1 ◽

2021 ◽

Author(s):

Dhairya Vyas

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Learning Algorithms ◽

Numerical Data ◽

Machine Learning Algorithms ◽

Series Data ◽

Learning Methods ◽

Machine Learning Methods ◽

Almost All

In terms of Machine Learning, the majority of the data can be grouped into four categories: numerical data, category data, time-series data, and text. We use different classifiers for different data properties, such as the Supervised; Unsupervised; and Reinforcement. Each Categorises has classifier we have tested almost all machine learning methods and make analysis among them.

Download Full-text

Automatic solution for solar cell photo-current prediction using machine learning

E3S Web of Conferences ◽

10.1051/e3sconf/202129701029 ◽

2021 ◽

Vol 297 ◽

pp. 01029

Author(s):

Mohammed Azza ◽

Jabran Daaif ◽

Adnane Aouidate ◽

El Hadi Chahid ◽

Said Belaaouad

Keyword(s):

Machine Learning ◽

Solar Cell ◽

Linear Regression ◽

Prediction Model ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Prediction Methods ◽

Lasso Regression ◽

Photo Current

In this paper, we discuss the prediction of future solar cell photo-current generated by the machine learning algorithm. For the selection of prediction methods, we compared and explored different prediction methods. Precision, MSE and MAE were used as models due to its adaptable and probabilistic methodology on model selection. This study uses machine learning algorithms as a research method that develops models for predicting solar cell photo-current. We create an electric current prediction model. In view of the models of machine learning algorithms for example, linear regression, Lasso regression, K Nearest Neighbors, decision tree and random forest, watch their order precision execution. In this point, we recommend a solar cell photocurrent prediction model for better information based on resistance assessment. These reviews show that the linear regression algorithm, given the precision, reliably outperforms alternative models in performing the solar cell photo-current prediction Iph

Download Full-text

Evaluation and calibration of a low-cost particle sensor in ambient conditions using machine-learning methods

Atmospheric Measurement Techniques ◽

10.5194/amt-13-1693-2020 ◽

2020 ◽

Vol 13 (4) ◽

pp. 1693-1707 ◽

Cited By ~ 8

Author(s):

Minxing Si ◽

Ying Xiong ◽

Shan Du ◽

Ke Du

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Search ◽

Low Cost ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Ambient Conditions ◽

Test Dataset ◽

Compact Size ◽

Significant Difference

Abstract. Particle sensing technology has shown great potential for monitoring particulate matter (PM) with very few temporal and spatial restrictions because of its low cost, compact size, and easy operation. However, the performance of low-cost sensors for PM monitoring in ambient conditions has not been thoroughly evaluated. Monitoring results by low-cost sensors are often questionable. In this study, a low-cost fine particle monitor (Plantower PMS 5003) was colocated with a reference instrument, the Synchronized Hybrid Ambient Real-time Particulate (SHARP) monitor, at the Calgary Varsity air monitoring station from December 2018 to April 2019. The study evaluated the performance of this low-cost PM sensor in ambient conditions and calibrated its readings using simple linear regression (SLR), multiple linear regression (MLR), and two more powerful machine-learning algorithms using random search techniques for the best model architectures. The two machine-learning algorithms are XGBoost and a feedforward neural network (NN). Field evaluation showed that the Pearson correlation (r) between the low-cost sensor and the SHARP instrument was 0.78. The Fligner and Killeen (F–K) test indicated a statistically significant difference between the variances of the PM2.5 values by the low-cost sensor and the SHARP instrument. Large overestimations by the low-cost sensor before calibration were observed in the field and were believed to be caused by the variation of ambient relative humidity. The root mean square error (RMSE) was 9.93 when comparing the low-cost sensor with the SHARP instrument. The calibration by the feedforward NN had the smallest RMSE of 3.91 in the test dataset compared to the calibrations by SLR (4.91), MLR (4.65), and XGBoost (4.19). After calibrations, the F–K test using the test dataset showed that the variances of the PM2.5 values by the NN, XGBoost, and the reference method were not statistically significantly different. From this study, we conclude that a feedforward NN is a promising method to address the poor performance of low-cost sensors for PM2.5 monitoring. In addition, the random search method for hyperparameters was demonstrated to be an efficient approach for selecting the best model structure.

Download Full-text

Using Big Data-machine learning models for diabetes prediction and flight delays analytics

Journal Of Big Data ◽

10.1186/s40537-020-00355-0 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Thérence Nibareke ◽

Jalal Laassiri

Keyword(s):

Machine Learning ◽

Big Data ◽

Linear Regression ◽

Decision Tree ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Smart Devices ◽

Learning Models ◽

Flight Delays ◽

Machine Learning Models

Abstract Introduction Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. Case description We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. Discussion and evaluation The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Conclusions Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).

Download Full-text

Assessment of Machine Learning Algorithms for Modeling the Spatial Distribution of Bark Beetle Infestation

Forests ◽

10.3390/f12040395 ◽

2021 ◽

Vol 12 (4) ◽

pp. 395

Author(s):

Milan Koreň ◽

Rastislav Jakuš ◽

Martin Zápotocký ◽

Ivan Barka ◽

Jaroslav Holuša ◽

...

Keyword(s):

Machine Learning ◽

Spatial Distribution ◽

Discriminant Analysis ◽

Bark Beetle ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Forest Damage ◽

Spruce Forest ◽

Explanatory Variables ◽

The Mean

Machine learning algorithms (MLAs) are used to solve complex non-linear and high-dimensional problems. The objective of this study was to identify the MLA that generates an accurate spatial distribution model of bark beetle (Ips typographus L.) infestation spots. We first evaluated the performance of 2 linear (logistic regression, linear discriminant analysis), 4 non-linear (quadratic discriminant analysis, k-nearest neighbors classifier, Gaussian naive Bayes, support vector classification), and 4 decision trees-based MLAs (decision tree classifier, random forest classifier, extra trees classifier, gradient boosting classifier) for the study area (the Horní Planá region, Czech Republic) for the period 2003–2012. Each MLA was trained and tested on all subsets of the 8 explanatory variables (distance to forest damage spots from previous year, distance to spruce forest edge, potential global solar radiation, normalized difference vegetation index, spruce forest age, percentage of spruce, volume of spruce wood per hectare, stocking). The mean phi coefficient of the model generated by extra trees classifier (ETC) MLA with five explanatory variables for the period was significantly greater than that of most forest damage models generated by the other MLAs. The mean true positive rate of the best ETC-based model was 80.4%, and the mean true negative rate was 80.0%. The spatio-temporal simulations of bark beetle-infested forests based on MLAs and GIS tools will facilitate the development and testing of novel forest management strategies for preventing forest damage in general and bark beetle outbreaks in particular.

Download Full-text

URLCam: Toolkit for malicious URL analysis and modeling

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189874 ◽

2021 ◽

pp. 1-15

Author(s):

Mohammed Ayub ◽

El-Sayed M. El-Alfy

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Feature Selection ◽

State Of The Art ◽

Learning Algorithms ◽

Extraction Methods ◽

Machine Learning Algorithms ◽

The Other ◽

Imbalanced Learning ◽

Almost All

Web technology has become an indispensable part in human’s life for almost all activities. On the other hand, the trend of cyberattacks is on the rise in today’s modern Web-driven world. Therefore, effective countermeasures for the analysis and detection of malicious websites is crucial to combat the rising threats to the cyber world security. In this paper, we systematically reviewed the state-of-the-art techniques and identified a total of about 230 features of malicious websites, which are classified as internal and external features. Moreover, we developed a toolkit for the analysis and modeling of malicious websites. The toolkit has implemented several types of feature extraction methods and machine learning algorithms, which can be used to analyze and compare different approaches to detect malicious URLs. Moreover, the toolkit incorporates several other options such as feature selection and imbalanced learning with flexibility to be extended to include more functionality and generalization capabilities. Moreover, some use cases are demonstrated for different datasets.

Download Full-text

Comparative Analysis of Supervised Machine Learning Algorithms to Build a Predictive Model for Evaluating Students’ Performance

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v17i02.20025 ◽

2021 ◽

Vol 17 (02) ◽

pp. 90

Author(s):

Inssaf El Guabassi ◽

Zakaria Bousalem ◽

Rim Marah ◽

Aimad Qazdar

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Predictive Model ◽

Learning Algorithms ◽

Research Work ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector ◽

The Future ◽

Log Linear

In recent years, the world's population is increasingly demanding to predict the future with certainty, predicting the right information in any area is becoming a necessity. One of the ways to predict the future with certainty is to determine the possible future. In this sense, machine learning is a way to analyze huge datasets to make strong predictions or decisions. The main objective of this research work is to build a predictive model for evaluating students’ performance. Hence, the contributions are threefold. The first is to apply several supervised machine learning algorithms (i.e. ANCOVA, Logistic Regression, Support Vector Regression, Log-linear Regression, Decision Tree Regression, Random Forest Regression, and Partial Least Squares Regression) on our education dataset. The second purpose is to compare and evaluate algorithms used to create a predictive model based on various evaluation metrics. The last purpose is to determine the most important factors that influence the success or failure of the students. The experimental results showed that the Log-linear Regression provides a better prediction as well as the behavioral factors that influence students’ performance.

Download Full-text

Predicting Future Products Rate using Machine Learning Algorithms

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2020.05.04 ◽

2020 ◽

Vol 12 (5) ◽

pp. 41-51

Author(s):

Shaimaa Mahmoud ◽

◽

Mahmoud Hussein ◽

Arabi Keshk

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Mean Squared Error ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Regression ◽

Data Set ◽

Squared Error

Opinion mining in social networks data is considered as one of most important research areas because a large number of users interact with different topics on it. This paper discusses the problem of predicting future products rate according to users’ comments. Researchers interacted with this problem by using machine learning algorithms (e.g. Logistic Regression, Random Forest Regression, Support Vector Regression, Simple Linear Regression, Multiple Linear Regression, Polynomial Regression and Decision Tree). However, the accuracy of these techniques still needs to be improved. In this study, we introduce an approach for predicting future products rate using LR, RFR, and SVR. Our data set consists of tweets and its rate from 1:5. The main goal of our approach is improving the prediction accuracy about existing techniques. SVR can predict future product rate with a Mean Squared Error (MSE) of 0.4122, Linear Regression model predict with a Mean Squared Error of 0.4986 and Random Forest Regression can predict with a Mean Squared Error of 0.4770. This is better than the existing approaches accuracy.

Download Full-text