Review of Gregor et al: Empirical methods for the estimation of Southern Ocean CO2: Support Vector and Random Forest Regression

2017 ◽  
Author(s):  
Anonymous
2017 ◽  
Vol 14 (23) ◽  
pp. 5551-5569 ◽  
Author(s):  
Luke Gregor ◽  
Schalk Kok ◽  
Pedro M. S. Monteiro

Abstract. The Southern Ocean accounts for 40 % of oceanic CO2 uptake, but the estimates are bound by large uncertainties due to a paucity in observations. Gap-filling empirical methods have been used to good effect to approximate pCO2 from satellite observable variables in other parts of the ocean, but many of these methods are not in agreement in the Southern Ocean. In this study we propose two additional methods that perform well in the Southern Ocean: support vector regression (SVR) and random forest regression (RFR). The methods are used to estimate ΔpCO2 in the Southern Ocean based on SOCAT v3, achieving similar trends to the SOM-FFN method by Landschützer et al. (2014). Results show that the SOM-FFN and RFR approaches have RMSEs of similar magnitude (14.84 and 16.45 µatm, where 1 atm  =  101 325 Pa) where the SVR method has a larger RMSE (24.40 µatm). However, the larger errors for SVR and RFR are, in part, due to an increase in coastal observations from SOCAT v2 to v3, where the SOM-FFN method used v2 data. The success of both SOM-FFN and RFR depends on the ability to adapt to different modes of variability. The SOM-FFN achieves this by having independent regression models for each cluster, while this flexibility is intrinsic to the RFR method. Analyses of the estimates shows that the SVR and RFR's respective sensitivity and robustness to outliers define the outcome significantly. Further analyses on the methods were performed by using a synthetic dataset to assess the following: which method (RFR or SVR) has the best performance? What is the effect of using time, latitude and longitude as proxy variables on ΔpCO2? What is the impact of the sampling bias in the SOCAT v3 dataset on the estimates? We find that while RFR is indeed better than SVR, the ensemble of the two methods outperforms either one, due to complementary strengths and weaknesses of the methods. Results also show that for the RFR and SVR implementations, it is better to include coordinates as proxy variables as RMSE scores are lowered and the phasing of the seasonal cycle is more accurate. Lastly, we show that there is only a weak bias due to undersampling. The synthetic data provide a useful framework to test methods in regions of sparse data coverage and show potential as a useful tool to evaluate methods in future studies.


2017 ◽  
Author(s):  
Luke Gregor ◽  
Schalk Kok ◽  
Pedro M. S. Monteiro

Abstract. The Southern Ocean accounts for 40 % of oceanic CO2 uptake, but the estimates are bound by large uncertainties due to a paucity in observations. Gap filling empirical methods have been used to good effect to approximate pCO2 from satellite observable variables in other parts of the ocean, but many of these methods are not in agreement in the Southern Ocean. In this study we propose two additional methods that perform well in the Southern Ocean: Support Vector Regression (SVR) and Random Forest Regression (RFR). The methods are used to estimate ∆pCO2 in the Southern Ocean, achieving similar results to the SOM-FFN method by Landschützer et al. (2014a). The RFR as able to achieve better RMSE (12.26 µatm) compared the SVR (16.04 µatm) and SOM-FFN (12.97 µatm). To assess the efficacy of the methods and the limits of the training dataset (SOCAT v3), SVR and RFR are applied in a modelled environment. Again the RFR method outperformed the SVR by a substantial margin. However, both methods achieved higher out-of-sample than in-sample errors, indicating that the SOCAT v3 dataset is not yet fully representative of the Southern Ocean. The SVR was able to generalise better to the training dataset than the RFR with lower ratio between the out-of-sample and in-sample errors, but not enough to compensate for its poorer performance. The ensemble of the estimates show that interannual variability of the Southern Ocean CO2 sink is dominated by the Polar Frontal Zone, while the Sub-Antarctic Zone is the dominant sink.


2019 ◽  
Vol 11 (11) ◽  
pp. 3222 ◽  
Author(s):  
Pascal Schirmer ◽  
Iosif Mporas

In this paper we evaluate several well-known and widely used machine learning algorithms for regression in the energy disaggregation task. Specifically, the Non-Intrusive Load Monitoring approach was considered and the K-Nearest-Neighbours, Support Vector Machines, Deep Neural Networks and Random Forest algorithms were evaluated across five datasets using seven different sets of statistical and electrical features. The experimental results demonstrated the importance of selecting both appropriate features and regression algorithms. Analysis on device level showed that linear devices can be disaggregated using statistical features, while for non-linear devices the use of electrical features significantly improves the disaggregation accuracy, as non-linear appliances have non-sinusoidal current draw and thus cannot be well parametrized only by their active power consumption. The best performance in terms of energy disaggregation accuracy was achieved by the Random Forest regression algorithm.


Complexity ◽  
2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Marium Mehmood ◽  
Nasser Alshammari ◽  
Saad Awadh Alanazi ◽  
Fahad Ahmad

The liver is the human body’s mandatory organ, but detecting liver disease at an early stage is very difficult due to the hiddenness of symptoms. Liver diseases may cause loss of energy or weakness when some irregularities in the working of the liver get visible. Cancer is one of the most common diseases of the liver and also the most fatal of all. Uncontrolled growth of harmful cells is developed inside the liver. If diagnosed late, it may cause death. Treatment of liver diseases at an early stage is, therefore, an important issue as is designing a model to diagnose early disease. Firstly, an appropriate feature should be identified which plays a more significant part in the detection of liver cancer at an early stage. Therefore, it is essential to extract some essential features from thousands of unwanted features. So, these features will be mined using data mining and soft computing techniques. These techniques give optimized results that will be helpful in disease diagnosis at an early stage. In these techniques, we use feature selection methods to reduce the dataset’s feature, which include Filter, Wrapper, and Embedded methods. Different Regression algorithms are then applied to these methods individually to evaluate the result. Regression algorithms include Linear Regression, Ridge Regression, LASSO Regression, Support Vector Regression, Decision Tree Regression, Multilayer Perceptron Regression, and Random Forest Regression. Based on the accuracy and error rates generated by these Regression algorithms, we have evaluated our results. The result shows that Random Forest Regression with the Wrapper Method from all the deployed Regression techniques is the best and gives the highest R2-Score of 0.8923 and lowest MSE of 0.0618.


2021 ◽  
Vol 9 ◽  
Author(s):  
Qingyu Huang ◽  
Yang Yu ◽  
Yaoyi Zhang ◽  
Bo Pang ◽  
Yafeng Wang ◽  
...  

In the current nuclear reactor system analysis codes, the interfacial area concentration and void fraction are mainly obtained through empirical relations based on different flow regime maps. In the present research, the data-driven method has been proposed, using four machine learning algorithms (lasso regression, support vector regression, random forest regression and back propagation neural network) in the field of artificial intelligence to predict some important two-phase flow parameters in rectangular channels, and evaluate the performance of different models through multiple metrics. The random forest regression algorithm was found to have the strongest ability to learn from the experimental data in this study. Test results show that the prediction errors of the random forest regression model for interfacial area concentrations and void fractions are all less than 20%, which means the target parameters have been forecasted with good accuracy.


The advanced computing techniques and its applications on other engineering disciplines accelerated the different aspects and phases in engineering process. Nowadays there are so many computer aided methods widely used in civil engineering domain. The mathematical relationship between ratios of different concrete components and other influencing factors with its compression strength need to be analyzed for different engineering needs. This paper aims to develop a mathematical relationship after analyzing the above factors and to foresee the compressive strength of concrete by applying various regression techniques such as linear regression, support vector regression, decision tree regression and random forest regression on assumeddata set., It was found that the accuracy of the random forest regression was considerable as per the result after applying the various regression techniques.


2020 ◽  
Author(s):  
Junyan Wang ◽  
Chunyan Wang ◽  
Lihong Fu ◽  
Qian Wang ◽  
Guangping Fu ◽  
...  

AbstractIn forensic science, accurate estimation of the age of a victim or suspect can facilitate the investigators to narrow a search and aid in solving a crime. Aging is a complex process associated with various molecular regulation on DNA or RNA levels. Recent studies have shown that circular RNAs (circRNAs) upregulate globally during aging in multiple organisms such as mice and elegans because of their ability to resist degradation by exoribonucleases. In the current study, we attempted to investigate circRNAs’ potential capability of age prediction. Here, we identified more than 40,000 circRNAs in the blood of thirteen Chinese unrelated healthy individuals with ages of 20-62 years according to their circRNA-seq profiles. Three methods were applied to select age-related circRNAs candidates including false discovery rate, lasso regression, and support vector machine. The analysis uncovered a strong bias for circRNA upregulation during aging in human blood. A total of 28 circRNAs were chosen for further validation in 50 healthy unrelated subjects aged between 19 and 72 years by RT-qPCR and finally, 7 age-related circRNAs were chosen for final age prediction models. Several different algorithms including multivariate linear regression (MLR), regression tree, bagging regression, random forest regression (RFR), and support vector regression (SVR) were compared based on root mean square error (RMSE) and mean average error (MAE) values. Among five modeling methods, random forest regression (RFR) performed better than the others with an RMSE value of 5.072 years and an MAE value of 4.065 years (R2 = 0.902). In this preliminary study, we firstly used circRNAs as additional novel age-related biomarkers for developing forensic age estimation models. We propose that the use of circRNAs to obtain additional clues for forensic investigations and serve as aging indicators for age prediction would become a promising field of interest.Author summaryIn forensic investigations, estimation of the age of biological evidence recovered from crime scenes can provide additional information such as chronological age or the appearance of a culprit, which could give valuable investigative leads especially when there is no eyewitness available. Hence, generating an accurate model for age prediction using body fluids such as blood commonly seen at a crime scene can be of vital importance. Various molecular changes on DNA or RNA levels were discovered that they upregulated or downregulated during a person’s lifetime. Although some biomarkers have been proved to be associated with aging and used to predict age, several disadvantages such as low sensitivity, prediction accuracy, instability and susceptibility of diseases or immune states, thus limiting their applicability in the field of age estimation. Here, we utilized a novel biomarker namely circular RNA (circRNA) to generate highly accurate age prediction models. We propose that circRNA is more suitable for forensic degradation samples because of its unique molecular structure. This preliminary research offers a new thought for exploring potential biomarker for age prediction.


2022 ◽  
Vol 2161 (1) ◽  
pp. 012053
Author(s):  
B P Ashwini ◽  
R Sumathi ◽  
H S Sudhira

Abstract Congested roads are a global problem, and increased usage of private vehicles is one of the main reasons for congestion. Public transit modes of travel are a sustainable and eco-friendly alternative for private vehicle usage, but attracting commuters towards public transit mode is a mammoth task. Commuters expect the public transit service to be reliable, and to provide a reliable service it is necessary to fine-tune the transit operations and provide well-timed necessary information to commuters. In this context, the public transit travel time is predicted in Tumakuru, a tier-2 city of Karnataka, India. As this is one of the initial studies in the city, the performance comparison of eight Machines Learning models including four linear namely, Linear Regression, Ridge Regression, Least Absolute Shrinkage and Selection Operator Regression, and Support Vector Regression; and four non-linear models namely, k-Nearest Neighbors, Regression Trees, Random Forest Regression, and Gradient Boosting Regression Trees is conducted to identify a suitable model for travel time predictions. The data logs of one month (November 2020) of the Tumakuru city service, provided by Tumakuru Smart City Limited are used for the study. The time-of-the-day (trip start time), day-of-the-week, and direction of travel are used for the prediction. Travel time for both upstream and downstream are predicted, and the results are evaluated based on the performance metrics. The results suggest that the performance of non-linear models is superior to linear models for predicting travel times, and Random Forest Regression was found to be a better model as compared to other models.


2019 ◽  
Vol 46 (5) ◽  
pp. 353-363 ◽  
Author(s):  
Chaozhe Jiang ◽  
Ping Huang ◽  
Javad Lessan ◽  
Liping Fu ◽  
Chao Wen

Accurate prediction of recoverable train delay can support the train dispatchers’ decision-making with timetable rescheduling and improving service reliability. In this paper, we present the results of an effort aimed to develop primary delay recovery (PDR) predictor model using train operation records from Wuhan-Guangzhou (W-G) high-speed railway. To this end, we first identified the main variables that contribute to delay, including dwell buffer time, running buffer time, magnitude of primary delay time, and individual sections’ influence. Different models are applied and calibrated to predict the PDR. The validation results on test datasets indicate that the random forest regression (RFR) model outperforms the other three alternative models, namely, multiple linear regression (MLR), support vector machine (SVM), and artificial neural networks (ANN) regarding prediction accuracy measure. Specifically, the evaluation results show that when the prediction tolerance is less than 1 min, the RFR model can achieve up to 80.4% of prediction accuracy, while the accuracy level is 44.4%, 78.5%, and 78.5% for MLR, SVM, and ANN models, respectively.


Sign in / Sign up

Export Citation Format

Share Document