scholarly journals Arbitrage Pricing Model Based on Factor Analysis-Random Forest Regression and its application

Firstly, this paper establishes K-factor linear model and arbitrage pricing model (ATP) according to ‘the Asset Pricing Model-Arbitrage Pricing Theory’, Then from 2001 to 2017, the Statistical Yearbook of the National Bureau of Statistics collected 10 factors as the original factors such as gross national product, gross industrial product and gross tertiary industry product. After synthesis and simplification, three common factors are extracted to replace ten original factors.The first common factor variable is used to reflect the overall economic level of the country;The second common factor variable reflects a country's inflation rate;The third public factor variable reflects the total annual net export trade situation of the country. After the common factor is determined, the value of the common factor is calculated from the original data.Collect the annual return of 10 stocks for 17 years and do twice random forest regression,we get the arbitrage pricing model. Then, based on the same common factor data, another arbitrage pricing model is obtained by imitating the linear regression method of previous similar papers. By comparing the pricing error, we can find the pricing effect of the model obtained by random forest regression is better than that of the model obtained by linear regression.

2016 ◽  
Author(s):  
Michael Maraun ◽  
Moritz Heene

There has come to exist within the psychometric literature a generalized belief to the effect that a determination of the level of factorial invariance that holds over a set of k populations Δj, j = 1..s, is central to ascertaining whether or not the common factor random variables ξj, j = 1..s, are equivalent. In the current manuscript, a technical examination of this belief is undertaken. The chief conclusion of the work is that, as long as technical, statistical senses of random variable equivalence are adhered to, the belief is unfounded.


2020 ◽  
Author(s):  
Peijia Liu ◽  
Dong Yang ◽  
Shaomin Li ◽  
Yutian Chong ◽  
Wentao Hu ◽  
...  

Abstract Background The utilization of estimating-GFR equations is critical for kidney disease in the clinic. However, the performance of the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation has not improved substantially in the past eight years. Here we hypothesized that random forest regression(RF) method could go beyond revised linear regression, which is used to build the CKD-EPI equationMethods 1732 participants were enrolled in this study totally (1333 in development data set from Tianhe District and 399 in external data set Luogang District). Recursive feature elimination (RFE) is applied to the development data to select important variables and build random forest models. Then same variables were used to develop the estimated GFR equation with linear regression as a comparison. The performances of these equations are measured by bias, 30% accuracy , precision and root mean square error(RMSE).Results Of all the variables, creatinine, cystatin C, weight, body mass index (BMI), age, uric acid(UA), blood urea nitrogen(BUN), hematocrit(HCT) and apolipoprotein B(APOB) were selected by RFE method. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. In the 9-variable model, RF model was better than revised linear regression in term of bias, precision ,30%accuracy and RMSE(0.78 vs 2.98, 16.90 vs 23.62, 0.84 vs 0.80, 16.88 vs 18.70, all P<0.01 ). In the 4-variable model, random forest regression model showed an improvement in precision and RMSE compared with revised regression model. (20.82 vs 25.25, P<0.01, 19.08 vs 20.60, P<0.001). Bias and 30%accurancy were preferable, but the results were not statistically significant (0.34 vs 2.07, P=0.10, 0.8 vs 0.78, P=0.19, respectively).Conclusions The performances of random forest regression models are better than revised linear regression models when it comes to GFR estimation.


Hydrology ◽  
2021 ◽  
Vol 8 (4) ◽  
pp. 153
Author(s):  
Eva Melišová ◽  
Adam Vizina ◽  
Martin Hanel ◽  
Petr Pavlík ◽  
Petra Šuhájková

Evaporation is an important factor in the overall hydrological balance. It is usually derived as the difference between runoff, precipitation and the change in water storage in a catchment. The magnitude of actual evaporation is determined by the quantity of available water and heavily influenced by climatic and meteorological factors. Currently, there are statistical methods such as linear regression, random forest regression or machine learning methods to calculate evaporation. However, in order to derive these relationships, it is necessary to have observations of evaporation from evaporation stations. In the present study, the statistical methods of linear regression and random forest regression were used to calculate evaporation, with part of the models being designed manually and the other part using stepwise regression. Observed data from 24 evaporation stations and ERA5-Land climate reanalysis data were used to create the regression models. The proposed regression formulas were tested on 33 water reservoirs. The results show that manual regression is a more appropriate method for calculating evaporation than stepwise regression, with the caveat that it is more time consuming. The difference between linear and random forest regression is the variance of the data; random forest regression is better able to fit the observed data. On the other hand, the interpretation of the result for linear regression is simpler. The study introduced that the use of reanalyzed data, ERA5-Land products using the random forest regression method is suitable for the calculation of evaporation from water reservoirs in the conditions of the Czech Republic.


2019 ◽  
Vol 11 (8) ◽  
pp. 920 ◽  
Author(s):  
Syed Haleem Shah ◽  
Yoseline Angel ◽  
Rasmus Houborg ◽  
Shawkat Ali ◽  
Matthew F. McCabe

Developing rapid and non-destructive methods for chlorophyll estimation over large spatial areas is a topic of much interest, as it would provide an indirect measure of plant photosynthetic response, be useful in monitoring soil nitrogen content, and offer the capacity to assess vegetation structural and functional dynamics. Traditional methods of direct tissue analysis or the use of handheld meters, are not able to capture chlorophyll variability at anything beyond point scales, so are not particularly useful for informing decisions on plant health and status at the field scale. Examining the spectral response of plants via remote sensing has shown much promise as a means to capture variations in vegetation properties, while offering a non-destructive and scalable approach to monitoring. However, determining the optimum combination of spectra or spectral indices to inform plant response remains an active area of investigation. Here, we explore the use of a machine learning approach to enhance the estimation of leaf chlorophyll (Chlt), defined as the sum of chlorophyll a and b, from spectral reflectance data. Using an ASD FieldSpec 4 Hi-Res spectroradiometer, 2700 individual leaf hyperspectral reflectance measurements were acquired from wheat plants grown across a gradient of soil salinity and nutrient levels in a greenhouse experiment. The extractable Chlt was determined from laboratory analysis of 270 collocated samples, each composed of three leaf discs. A random forest regression algorithm was trained against these data, with input predictors based upon (1) reflectance values from 2102 bands across the 400–2500 nm spectral range; and (2) 45 established vegetation indices. As a benchmark, a standard univariate regression analysis was performed to model the relationship between measured Chlt and the selected vegetation indices. Results show that the root mean square error (RMSE) was significantly reduced when using the machine learning approach compared to standard linear regression. When exploiting the entire spectral range of individual bands as input variables, the random forest estimated Chlt with an RMSE of 5.49 µg·cm−2 and an R2 of 0.89. Model accuracy was improved when using vegetation indices as input variables, producing an RMSE ranging from 3.62 to 3.91 µg·cm−2, depending on the particular combination of indices selected. In further analysis, input predictors were ranked according to their importance level, and a step-wise reduction in the number of input features (from 45 down to 7) was performed. Implementing this resulted in no significant effect on the RMSE, and showed that much the same prediction accuracy could be obtained by a smaller subset of indices. Importantly, the random forest regression approach identified many important variables that were not good predictors according to their linear regression statistics. Overall, the research illustrates the promise in using established vegetation indices as input variables in a machine learning approach for the enhanced estimation of Chlt from hyperspectral data.


2019 ◽  
Vol 46 (5) ◽  
pp. 353-363 ◽  
Author(s):  
Chaozhe Jiang ◽  
Ping Huang ◽  
Javad Lessan ◽  
Liping Fu ◽  
Chao Wen

Accurate prediction of recoverable train delay can support the train dispatchers’ decision-making with timetable rescheduling and improving service reliability. In this paper, we present the results of an effort aimed to develop primary delay recovery (PDR) predictor model using train operation records from Wuhan-Guangzhou (W-G) high-speed railway. To this end, we first identified the main variables that contribute to delay, including dwell buffer time, running buffer time, magnitude of primary delay time, and individual sections’ influence. Different models are applied and calibrated to predict the PDR. The validation results on test datasets indicate that the random forest regression (RFR) model outperforms the other three alternative models, namely, multiple linear regression (MLR), support vector machine (SVM), and artificial neural networks (ANN) regarding prediction accuracy measure. Specifically, the evaluation results show that when the prediction tolerance is less than 1 min, the RFR model can achieve up to 80.4% of prediction accuracy, while the accuracy level is 44.4%, 78.5%, and 78.5% for MLR, SVM, and ANN models, respectively.


2020 ◽  
Author(s):  
Peijia Liu ◽  
Dong Yang ◽  
Shaomin Li ◽  
Yutian Chong ◽  
Ming Li ◽  
...  

Abstract Background The utilization of estimating-GFR equations is critical for kidney disease in the clinic. However, the performance of the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) equation has not improved substantially in the past eight years. Here we hypothesized that random forest regression(RF) method could go beyond revised linear regression, which is used to build the CKD-EPI equation Methods 1732 participants were enrolled in this study totally (1333 in development data set from Tianhe District and 399 in external data set Luogang District). Recursive feature elimination (RFE) is applied to the development data to select important variables and build random forest models. Then same variables were used to develop the estimated GFR equation with linear regression as a comparison. The performances of these equations are measured by bias, 30% accuracy, precision and root mean square error(RMSE). Results Of all the variables, creatinine, cystatin C, weight, body mass index (BMI), age, uric acid(UA), blood urea nitrogen(BUN), hematocrit(HCT) and apolipoprotein B(APOB) were selected by RFE method. The results revealed that the overall performance of random forest regression models ascended the revised regression models based on the same variables. In the 9-variable model, RF model was better than revised linear regression in term of bias, precision ,30%accuracy and RMSE(0.78 vs 2.98, 16.90 vs 23.62, 0.84 vs 0.80, 16.88 vs 18.70, all P < 0.01 ). In the 4-variable model, random forest regression model showed an improvement in precision and RMSE compared with revised regression model. (20.82 vs 25.25, P < 0.01, 19.08 vs 20.60, P < 0.001). Bias and 30%accurancy were preferable, but the results were not statistically significant (0.34 vs 2.07, P = 0.10, 0.8 vs 0.78, P = 0.19, respectively). Conclusions The performances of random forest regression models are better than revised linear regression models when it comes to GFR estimation.


2020 ◽  
Vol 82 (8) ◽  
pp. 1586-1602
Author(s):  
Bahareh Beigzadeh ◽  
Mehdi Bahrami ◽  
Mohammad Javad Amiri ◽  
Mohammad Reza Mahmoudi

Abstract The mathematical model's usage in water quality prediction has received more interest recently. In this research, the potential of random forest regression (RFR), Bayesian multiple linear regression (BMLR), and multiple linear regression (MLR) were examined to predict the amount of 2,4-dichlorophenoxy acetic acid (2,4-D) elimination by rice husk biochar from synthetic wastewater, using five input operating parameters including initial 2,4-D concentration, adsorbent dosage, pH, reaction time, and temperature. The equilibrium and kinetic adsorption data were fitted best to the Freundlich and pseudo-first-order models. The thermodynamic parameters also indicated the exothermic and spontaneous nature of adsorption. The modeling results indicated an R2 of 0.994, 0.992, and 0.945 and RMSE of 1.92, 6.17, and 2.10 for the relationship between the model-estimated and measured values of 2,4-D removal for RFR, BMLR, and MLR, respectively. Overall performances indicated more proficiency of RFR than the BMLR and MLR models due to its capability in capturing the non-linear relationships between input data and their associated removal capacities. The sensitivity analysis demonstrated that the 2,4-D adsorption process is more sensitive to initial 2,4-D concentration and adsorbent dosage. Thus, it is possible to permanently monitor waters more cost-effectively with the suggested model application.


2020 ◽  
Vol 12 (5) ◽  
pp. 41-51
Author(s):  
Shaimaa Mahmoud ◽  
◽  
Mahmoud Hussein ◽  
Arabi Keshk

Opinion mining in social networks data is considered as one of most important research areas because a large number of users interact with different topics on it. This paper discusses the problem of predicting future products rate according to users’ comments. Researchers interacted with this problem by using machine learning algorithms (e.g. Logistic Regression, Random Forest Regression, Support Vector Regression, Simple Linear Regression, Multiple Linear Regression, Polynomial Regression and Decision Tree). However, the accuracy of these techniques still needs to be improved. In this study, we introduce an approach for predicting future products rate using LR, RFR, and SVR. Our data set consists of tweets and its rate from 1:5. The main goal of our approach is improving the prediction accuracy about existing techniques. SVR can predict future product rate with a Mean Squared Error (MSE) of 0.4122, Linear Regression model predict with a Mean Squared Error of 0.4986 and Random Forest Regression can predict with a Mean Squared Error of 0.4770. This is better than the existing approaches accuracy.


2019 ◽  
Vol 245 ◽  
pp. 746-753 ◽  
Author(s):  
Weiran Yuchi ◽  
Enkhjargal Gombojav ◽  
Buyantushig Boldbaatar ◽  
Jargalsaikhan Galsuren ◽  
Sarangerel Enkhmaa ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document