data splitting
Recently Published Documents


TOTAL DOCUMENTS

68
(FIVE YEARS 27)

H-INDEX

14
(FIVE YEARS 2)

Water ◽  
2021 ◽  
Vol 13 (23) ◽  
pp. 3478
Author(s):  
Xiaoqiang Liu ◽  
Lifeng Wu ◽  
Fucang Zhang ◽  
Guomin Huang ◽  
Fulai Yan ◽  
...  

To improve the accuracy of estimating reference crop evapotranspiration for the efficient management of water resources and the optimal design of irrigation scheduling, the drawback of the traditional FAO-56 Penman–Monteith method requiring complete meteorological input variables needs to be overcome. This study evaluates the effects of using five data splitting strategies and three different time lengths of input datasets on predicting ET0. The random forest (RF) and extreme gradient boosting (XGB) models coupled with a K-fold cross-validation approach were applied to accomplish this objective. The results showed that the accuracy of the RF (R2 = 0.862, RMSE = 0.528, MAE = 0.383, NSE = 0.854) was overall better than that of XGB (R2 = 0.867, RMSE = 0.517, MAE = 0.377, NSE = 0.860) in different input parameters. Both the RF and XGB models with the combination of Tmax, Tmin, and Rs as inputs provided better accuracy on daily ET0 estimation than the corresponding models with other input combinations. Among all the data splitting strategies, S5 (with a 9:1 proportion) showed the optimal performance. Compared with the length of 30 years, the estimation accuracy of the 50-year length with limited data was reduced, while the length of meteorological data of 10 years improved the accuracy in southern China. Nevertheless, the performance of the 10-year data was the worst among the three time spans when considering the independent test. Therefore, to improve the daily ET0 predicting performance of the tree-based models in humid regions of China, the random forest model with datasets of 30 years and the 9:1 data splitting strategy is recommended.


2021 ◽  
Vol 150 (6) ◽  
pp. 4118-4127
Author(s):  
Roshan Roshankhah ◽  
Yasamin Karbalaeisadegh ◽  
Hastings Greer ◽  
Federico Mento ◽  
Gino Soldati ◽  
...  

2021 ◽  
Vol 132 ◽  
pp. 103403
Author(s):  
Rakesh Prakash ◽  
Rajesh Piplani ◽  
Jitamitra Desai

Land ◽  
2021 ◽  
Vol 10 (9) ◽  
pp. 989
Author(s):  
Minu Treesa Abraham ◽  
Neelima Satyam ◽  
Revuri Lokesh ◽  
Biswajeet Pradhan ◽  
Abdullah Alamri

Data driven methods are widely used for the development of Landslide Susceptibility Mapping (LSM). The results of these methods are sensitive to different factors, such as the quality of input data, choice of algorithm, sampling strategies, and data splitting ratios. In this study, five different Machine Learning (ML) algorithms are used for LSM for the Wayanad district in Kerala, India, using two different sampling strategies and nine different train to test ratios in cross validation. The results show that Random Forest (RF), K Nearest Neighbors (KNN), and Support Vector Machine (SVM) algorithms provide better results than Naïve Bayes (NB) and Logistic Regression (LR) for the study area. NB and LR algorithms are less sensitive to the sampling strategy and data splitting, while the performance of the other three algorithms is considerably influenced by the sampling strategy. From the results, both the choice of algorithm and sampling strategy are critical in obtaining the best suited landslide susceptibility map for a region. The accuracies of KNN, RF, and SVM algorithms have increased by 10.51%, 10.02%, and 4.98% with the use of polygon landslide inventory data, while for NB and LR algorithms, the performance was slightly reduced with the use of polygon data. Thus, the sampling strategy and data splitting ratio are less consequential with NB and algorithms, while more data points provide better results for KNN, RF, and SVM algorithms.


2021 ◽  
Vol 30 (4) ◽  
pp. 1-38
Author(s):  
Yingzhe Lyu ◽  
Heng Li ◽  
Mohammed Sayagh ◽  
Zhen Ming (Jack) Jiang ◽  
Ahmed E. Hassan

AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help practitioners handle the massive data produced during the operations of large-scale systems. However, due to the nature of the operation data, AIOps modeling faces several data splitting-related challenges, such as imbalanced data, data leakage, and concept drift. In this work, we study the data leakage and concept drift challenges in the context of AIOps and evaluate the impact of different modeling decisions on such challenges. Specifically, we perform a case study on two commonly studied AIOps applications: (1) predicting job failures based on trace data from a large-scale cluster environment and (2) predicting disk failures based on disk monitoring data from a large-scale cloud storage environment. First, we observe that the data leakage issue exists in AIOps solutions. Using a time-based splitting of training and validation datasets can significantly reduce such data leakage, making it more appropriate than using a random splitting in the AIOps context. Second, we show that AIOps solutions suffer from concept drift. Periodically updating AIOps models can help mitigate the impact of such concept drift, while the performance benefit and the modeling cost of increasing the update frequency depend largely on the application data and the used models. Our findings encourage future studies and practices on developing AIOps solutions to pay attention to their data-splitting decisions to handle the data leakage and concept drift challenges.


Technometrics ◽  
2021 ◽  
pp. 1-23
Author(s):  
V. Roshan Joseph ◽  
Akhil Vakayil

2021 ◽  
pp. 1-1
Author(s):  
Hao-Chiang Shao ◽  
Hsin-Chieh Wang ◽  
Weng-Tai Su ◽  
Chia-Wen Lin

Sign in / Sign up

Export Citation Format

Share Document