scholarly journals Comparing the Forecast Performance of Advanced Statistical and Machine Learning Techniques Using Huge Big Data: Evidence from Monte Carlo Experiments

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Faridoon Khan ◽  
Amena Urooj ◽  
Saud Ahmed Khan ◽  
Abdelaziz Alsubie ◽  
Zahra Almaspoor ◽  
...  

This research compares factor models based on principal component analysis (PCA) and partial least squares (PLS) with Autometrics, elastic smoothly clipped absolute deviation (E-SCAD), and minimax concave penalty (MCP) under different simulated schemes like multicollinearity, heteroscedasticity, and autocorrelation. The comparison is made with varying sample size and covariates. We found that in the presence of low and moderate multicollinearity, MCP often produces superior forecasts in contrast to small sample case, whereas E-SCAD remains better. In the case of high multicollinearity, the PLS-based factor model remained dominant, but asymptotically the prediction accuracy of E-SCAD significantly enhances compared to other methods. Under heteroscedasticity, MCP performs very well and most of the time beats the rival methods. In some circumstances under large samples, Autometrics provides a similar forecast as MCP. In the presence of low and moderate autocorrelation, MCP shows outstanding forecasting performance except for the small sample case, whereas E-SCAD produces a remarkable forecast. In the case of extreme autocorrelation, E-SCAD outperforms the rival techniques under both the small and medium samples, but further augmentation in sample size enables MCP forecast more accurate comparatively. To compare the predictive ability of all methods, we split the data into two halves (i.e., data over 1973–2007 as training data and data over 2008–2020 as testing data). Based on the root mean square error and mean absolute error, the PLS-based factor model outperforms the competitor models in terms of forecasting performance.

Author(s):  
Roya Nasimi ◽  
Fernando Moreu ◽  
John Stormont

Abstract Rockfalls are a hazard for the safety of infrastructure as well as people. Identifying loose rocks by inspection of slopes adjacent to roadways and other infrastructure and removing them in advance can be an effective way to prevent unexpected rockfall incidents. This paper proposes a system towards an automated inspection for potential rockfalls. A robot is used to repeatedly strike or tap on the rock surface. The sound from the tapping is collected by the robot and subsequently classified with the intent of identifying rocks that are broken and prone to fall. Principal Component Analysis (PCA) of the collected acoustic data is used to recognize patterns associated with rocks of various conditions, including intact as well as rock with different types and locations of cracks. The PCA classification was first demonstrated simulating sounds of different characteristics that were automatically trained and tested. Secondly, a laboratory test was conducted tapping rock specimens with three different levels of discontinuity in depth and shape. A real microphone mounted on the robot recorded the sound and the data were classified in three clusters within 2D space. A model was created using the training data to classify the reminder of the data (the test data). The performance of the method is evaluated with a confusion matrix.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Faridoon Khan ◽  
Amena Urooj ◽  
Kalim Ullah ◽  
Badr Alnssyan ◽  
Zahra Almaspoor

This work compares Autometrics with dual penalization techniques such as minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD) under asymmetric error distributions such as exponential, gamma, and Frechet with varying sample sizes as well as predictors. Comprehensive simulations, based on a wide variety of scenarios, reveal that the methods considered show improved performance for increased sample size. In the case of low multicollinearity, these methods show good performance in terms of potency, but in gauge, shrinkage methods collapse, and higher gauge leads to overspecification of the models. High levels of multicollinearity adversely affect the performance of Autometrics. In contrast, shrinkage methods are robust in presence of high multicollinearity in terms of potency, but they tend to select a massive set of irrelevant variables. Moreover, we find that expanding the data mitigates the adverse impact of high multicollinearity on Autometrics rapidly and gradually corrects the gauge of shrinkage methods. For empirical application, we take the gold prices data spanning from 1981 to 2020. While comparing the forecasting performance of all selected methods, we divide the data into two parts: data over 1981–2010 are taken as training data, and those over 2011–2020 are used as testing data. All methods are trained for the training data and then are assessed for performance through the testing data. Based on a root-mean-square error and mean absolute error, Autometrics remain the best in capturing the gold prices trend and producing better forecasts than MCP and SCAD.


2020 ◽  
Vol 24 (4) ◽  
Author(s):  
Hsiang-yu Chien ◽  
Oi-Man Kwok ◽  
Yu-Chen Yeh ◽  
Noelle Wall Sweany ◽  
Eunkyeng Baek ◽  
...  

The purpose of this study was to investigate a predictive model of online learners’ learning outcomes through machine learning. To create a model, we observed students’ motivation, learning tendencies, online learning-motivated attention, and supportive learning behaviors along with final test scores. A total of 225 college students who were taking online courses participated. Longitudinal data were collected over three semesters (T1, T2, and T3). T3 was used as training data given that it contained the largest sample size across all three data waves. To analyze the data, two approaches were applied: (a) stepwise logistic regression and (b) random forest (RF). Results showed that RF used fewer items and predicted final grades more accurately in a small sample. Furthermore, it selected four items that might potentially be used to identify at-risk learners even before they enroll in an online course.


Author(s):  
Rand Wilcox

There is an extensive literature dealing with inferences about the probability of success. A minor goal in this note is to point out when certain recommended methods can be unsatisfactory when the sample size is small. The main goal is to report results on the two-sample case. Extant results suggest using one of four methods. The results indicate when computing a 0.95 confidence interval, two of these methods can be more satisfactory when dealing with small sample sizes.


2016 ◽  
Vol 35 (2) ◽  
pp. 173-190 ◽  
Author(s):  
S. Shahid Shaukat ◽  
Toqeer Ahmed Rao ◽  
Moazzam A. Khan

AbstractIn this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.


2020 ◽  
Author(s):  
Said Ouala ◽  
Lucas Drumetz ◽  
Bertrand Chapron ◽  
Ananda Pascual ◽  
Fabrice Collard ◽  
...  

<p>Within the geosciences community, data-driven techniques have encountered a great success in the last few years. This is principally due to the success of machine learning techniques in several image and signal processing domains. However, when considering the data-driven simulation of ocean and atmospheric fields, the application of these methods is still an extremely challenging task due to the fact that the underlying dynamics usually depend on several complex hidden variables, which makes the learning and simulation process much more challenging.</p><p>In this work, we aim to extract Ordinary Differential Equations (ODE) from partial observations of a system. We propose a novel neural network architecture guided by physical and mathematical considerations of the underlying dynamics. Specifically, our architecture is able to simulate the dynamics of the system from a single initial condition even if the initial condition does not lie in the attractor spanned by the training data. We show on different case studies the effectiveness of the proposed framework both in capturing long term asymptotic patterns of the dynamics of the system and in addressing data assimilation issues which relates to the short term forecasting performance of our model.</p>


Author(s):  
Saba Akmal ◽  
Hafiz Muhammad Shahzad Asif

Clustering based sentiment analysis confers new directions to analyze real-world opinions without human participation and pre-tagged training data overhead. Clustering based techniques do not rely on linguistic information and more convenient as compared to other traditional machine learning techniques. Combining the dimensionality reduction techniques with clustering algorithms highly influence the computational cost and improve the performance of sentiment analysis. In this research, we applied Principal Component Analysis technique to reduce the size of features set. This reduced feature set improves binary K-means clustering results of sentiments analysis. In our experiments, we demonstrate the performance of the clustering system with a reduced feature set to provide high-quality sentiment analysis. However, K-mean clustering has its own limitations such as hard assignment and instability of results. To overcome the limitation of traditional K-means algorithm we applied soft clustering (Expectation maximization algorithm) approach which stabilizes clustering accuracy. This approach allows a soft assignment to cluster documents. Consequently, our experimental accuracy is 95% with standard deviation rate of 0.1% which is sufficient to apply the clustering technique in real-world applications.


2020 ◽  
Vol 21 ◽  
Author(s):  
Roberto Gabbiadini ◽  
Eirini Zacharopoulou ◽  
Federica Furfaro ◽  
Vincenzo Craviotto ◽  
Alessandra Zilli ◽  
...  

Background: Intestinal fibrosis and subsequent strictures represent an important burden in inflammatory bowel disease (IBD). The detection and evaluation of the degree of fibrosis in stricturing Crohn’s disease (CD) is important to address the best therapeutic strategy (medical anti-inflammatory therapy, endoscopic dilation, surgery). Ultrasound elastography (USE) is a non-invasive technique that has been proposed in the field of IBD for evaluating intestinal stiffness as a biomarker of intestinal fibrosis. Objective: The aim of this review is to discuss the ability and current role of ultrasound elastography in the assessment of intestinal fibrosis. Results and Conclusion: Data on USE in IBD are provided by pilot and proof-of-concept studies with small sample size. The first type of USE investigated was strain elastography, while shear wave elastography has been introduced lately. Despite the heterogeneity of the methods of the studies, USE has been proven to be able to assess intestinal fibrosis in patients with stricturing CD. However, before introducing this technique in current practice, further studies with larger sample size and homogeneous parameters, testing reproducibility, and identification of validated cut-off values are needed.


Sign in / Sign up

Export Citation Format

Share Document