scholarly journals Imputation of Missing Values in Economic and Financial Time Series Data Using Five Principal Component Analysis (PCA) Approaches

Author(s):  
Chisimkwuo John ◽  
Emmanuel J. Ekpenyong ◽  
Charles C. Nworu

This study assessed five approaches for imputing missing values. The evaluated methods include Singular Value Decomposition Imputation (svdPCA), Bayesian imputation (bPCA), Probabilistic imputation (pPCA), Non-Linear Iterative Partial Least squares imputation (nipalsPCA) and Local Least Squares imputation (llsPCA). A 5%, 10%, 15% and 20% missing data were created under a missing completely at random (MCAR) assumption using five (5) variables (Net Foreign Assets (NFA), Credit to Core Private Sector (CCP), Reserve Money (RM), Narrow Money (M1), Private Sector Demand Deposits (PSDD) from Nigeria quarterly monetary aggregate dataset from 1981 to 2019 using R-software. The data were collected from the Central Bank of Nigeria statistical bulletin. The five imputation methods were used to estimate the artificially generated missing values. The performances of the PCA imputation approaches were evaluated based on the Mean Forecast Error (MFE), Root Mean Squared Error (RMSE) and Normalized Root Mean Squared Error (NRMSE) criteria. The result suggests that the bPCA, llsPCA and pPCA methods performed better than other imputation methods with the bPCA being the more appropriate method and llsPCA, the best method as it appears to be more stable than others in terms of the proportion of missingness.

PLoS ONE ◽  
2022 ◽  
Vol 17 (1) ◽  
pp. e0262131
Author(s):  
Adil Aslam Mir ◽  
Kimberlee Jane Kearfott ◽  
Fatih Vehbi Çelebi ◽  
Muhammad Rafique

A new methodology, imputation by feature importance (IBFI), is studied that can be applied to any machine learning method to efficiently fill in any missing or irregularly sampled data. It applies to data missing completely at random (MCAR), missing not at random (MNAR), and missing at random (MAR). IBFI utilizes the feature importance and iteratively imputes missing values using any base learning algorithm. For this work, IBFI is tested on soil radon gas concentration (SRGC) data. XGBoost is used as the learning algorithm and missing data are simulated using R for different missingness scenarios. IBFI is based on the physically meaningful assumption that SRGC depends upon environmental parameters such as temperature and relative humidity. This assumption leads to a model obtained from the complete multivariate series where the controls are available by taking the attribute of interest as a response variable. IBFI is tested against other frequently used imputation methods, namely mean, median, mode, predictive mean matching (PMM), and hot-deck procedures. The performance of the different imputation methods was assessed using root mean squared error (RMSE), mean squared log error (MSLE), mean absolute percentage error (MAPE), percent bias (PB), and mean squared error (MSE) statistics. The imputation process requires more attention when multiple variables are missing in different samples, resulting in challenges to machine learning methods because some controls are missing. IBFI appears to have an advantage in such circumstances. For testing IBFI, Radon Time Series Data (RTS) has been used and data was collected from 1st March 2017 to the 11th of May 2018, including 4 seismic activities that have taken place during the data collection time.


2012 ◽  
Vol 61 (2) ◽  
pp. 277-290 ◽  
Author(s):  
Ádám Csorba ◽  
Vince Láng ◽  
László Fenyvesi ◽  
Erika Michéli

Napjainkban egyre nagyobb igény mutatkozik olyan technológiák és módszerek kidolgozására és alkalmazására, melyek lehetővé teszik a gyors, költséghatékony és környezetbarát talajadat-felvételezést és kiértékelést. Ezeknek az igényeknek felel meg a reflektancia spektroszkópia, mely az elektromágneses spektrum látható (VIS) és közeli infravörös (NIR) tartományában (350–2500 nm) végzett reflektancia-mérésekre épül. Figyelembe véve, hogy a talajokról felvett reflektancia spektrum információban nagyon gazdag, és a vizsgált tartományban számos talajalkotó rendelkezik karakterisztikus spektrális „ujjlenyomattal”, egyetlen görbéből lehetővé válik nagyszámú, kulcsfontosságú talajparaméter egyidejű meghatározása. Dolgozatunkban, a reflektancia spektroszkópia alapjaira helyezett, a talajok ösz-szetételének meghatározását célzó módszertani fejlesztés első lépéseit mutatjuk be. Munkánk során talajok szervesszén- és CaCO3-tartalmának megbecslését lehetővé tévő többváltozós matematikai-statisztikai módszerekre (részleges legkisebb négyzetek módszere, partial least squares regression – PLSR) épülő prediktív modellek létrehozását és tesztelését végeztük el. A létrehozott modellek tesztelése során megállapítottuk, hogy az eljárás mindkét talajparaméter esetében magas R2értéket [R2(szerves szén) = 0,815; R2(CaCO3) = 0,907] adott. A becslés pontosságát jelző közepes négyzetes eltérés (root mean squared error – RMSE) érték mindkét paraméter esetében közepesnek mondható [RMSE (szerves szén) = 0,467; RMSE (CaCO3) = 3,508], mely a reflektancia mérési előírások standardizálásával jelentősen javítható. Vizsgálataink alapján arra a következtetésre jutottunk, hogy a reflektancia spektroszkópia és a többváltozós kemometriai eljárások együttes alkalmazásával, gyors és költséghatékony adatfelvételezési és -értékelési módszerhez juthatunk.


Stats ◽  
2019 ◽  
Vol 2 (4) ◽  
pp. 457-467 ◽  
Author(s):  
Hossein Hassani ◽  
Mahdi Kalantari ◽  
Zara Ghodsi

In all fields of quantitative research, analysing data with missing values is an excruciating challenge. It should be no surprise that given the fragmentary nature of fossil records, the presence of missing values in geographical databases is unavoidable. As in such studies ignoring missing values may result in biased estimations or invalid conclusions, adopting a reliable imputation method should be regarded as an essential consideration. In this study, the performance of singular spectrum analysis (SSA) based on L 1 norm was evaluated on the compiled δ 13 C data from East Africa soil carbonates, which is a world targeted historical geology data set. Results were compared with ten traditionally well-known imputation methods showing L 1 -SSA performs well in keeping the variability of the time series and providing estimations which are less affected by extreme values, suggesting the method introduced here deserves further consideration in practice.


2015 ◽  
Vol 78 (4) ◽  
pp. 668-674 ◽  
Author(s):  
MATTHEW EADY ◽  
BOSOON PARK ◽  
SUN CHOI

This study was designed to evaluate hyperspectral microscope images for early and rapid detection of Salmonella serotypes Enteritidis, Heidelberg, Infantis, Kentucky, and Typhimurium at incubation times of 6, 8, 10, 12, and 24 h. Images were collected by an acousto-optical tunable filter hyperspectral microscope imaging system with a metal halide light source measuring 89 contiguous wavelengths every 4 nm between 450 and 800 nm. Pearson correlation values were calculated for incubation times of 8, 10, and 12 h and compared with data for 24 h to evaluate the change in spectral signatures from bacterial cells over time. Regions of interest were analyzed at 30% of the pixels in an average cell size. Spectral data were preprocessed by applying a global data transformation algorithm and then subjected to principal component analysis (PCA). The Mahalanobis distance was calculated from PCA score plots for analyzing serotype cluster separation. Partial least-squares regression was applied for calibration and validation of the model, and soft independent modeling of class analogy was utilized to classify serotype clusters in the training set. Pearson correlation values indicate very similar spectral patterns for reduced incubation times ranging from 0.9869 to 0.9990. PCA score plots indicated cluster separation at all incubation times, with incubation time Mahalanobis distances of 2.146 to 27.071. Partial least-squares regression had a maximum root mean squared error of calibration of 0.0025 and a root mean squared error of validation of 0.0030. Soft independent modeling of class analogy correctly classified values at 8 h (98.32%), 10 h (96.67%), 12 h (88.33%), and 24 h (98.67%) with the optimal number of principal components (four or five). The results of this study suggest that Salmonella serotypes can be classified by applying a PCA to hyperspectral microscope imaging data from samples after only 8 h of incubation.


2019 ◽  
Vol 40 (1) ◽  
pp. 127-135 ◽  
Author(s):  
Khemissi Houari ◽  
Tarik Hartani ◽  
Boualem Remini ◽  
Abdelouhab Lefkir ◽  
Leila Abda ◽  
...  

Abstract In this paper, the capacity of an Adaptive-Network-Based Fuzzy Inference System (ANFIS) for predicting salinity of the Tafna River is investigated. Time series data of daily liquid flow and saline concentrations from the gauging station of Pierre du Chat (160801) were used for training, validation and testing the hybrid model. Different methods were used to test the accuracy of our results, i.e. coefficient of determination (R2), Nash–Sutcliffe efficiency coefficient (E), root of the mean squared error (RSR) and graphic techniques. The model produced satisfactory results and showed a very good agreement between the predicted and observed data, with R2 equal (88% for training, 78.01% validation and 80.00% for testing), E equal (85.84% for training, 82.51% validation and 78.17% for testing), and RSR equal (2% for training, 10% validation and 49% for testing).


Methodology ◽  
2021 ◽  
Vol 17 (3) ◽  
pp. 189-204
Author(s):  
Cailey E. Fitzgerald ◽  
Ryne Estabrook ◽  
Daniel P. Martin ◽  
Andreas M. Brandmaier ◽  
Timo von Oertzen

Missing data are ubiquitous in psychological research. They may come about as an unwanted result of coding or computer error, participants' non-response or absence, or missing values may be intentional, as in planned missing designs. We discuss the effects of missing data on χ²-based goodness-of-fit indices in Structural Equation Modeling (SEM), specifically on the Root Mean Squared Error of Approximation (RMSEA). We use simulations to show that naive implementations of the RMSEA have a downward bias in the presence of missing data and, thus, overestimate model goodness-of-fit. Unfortunately, many state-of-the-art software packages report the biased form of RMSEA. As a consequence, the scientific community may have been accepting a much larger fraction of models with non-acceptable model fit. We propose a bias-correction for the RMSEA based on information-theoretic considerations that take into account the expected misfit of a person with fully observed data. The corrected RMSEA is asymptotically independent of the proportion of missing data for misspecified models. Importantly, results of the corrected RMSEA computation are identical to naive RMSEA if there are no missing data.


The main objective of this paper is to analyze the characteristics and features that affects the fluctuations of cryptocurrency prices and to develop aninteractive cryptocurrencychatbot for providing the predictive analysis of cryptocurrency prices. The chatbot is developed using IBM Watson assistant service. The predictive analytics is performed by analyzing the datasets of various cryptocurrencies and applying appropriate time series models. Time Series Forecasting is used for predicting the future values of the prices. Predictive models like ARIMA model is used for calculating the mean squared error of the fitted model. Facebook’s package prophet () which implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly and weekly seasonality are further used to predict cryptocurrency prices.


Sign in / Sign up

Export Citation Format

Share Document