A novel hybrid approach utilizing principal component regression and random forest regression to bridge the period of GPS outages

2015 ◽  
Vol 166 ◽  
pp. 185-192 ◽  
Author(s):  
Srujana Adusumilli ◽  
Deepak Bhatt ◽  
Hong Wang ◽  
Vijay Devabhaktuni ◽  
Prabir Bhattacharya
2020 ◽  
Vol 12 (23) ◽  
pp. 3850
Author(s):  
Hamid Ghanbari ◽  
Olivier Jacques ◽  
Marc-Élie Adaïmé ◽  
Irene Gregory-Eaves ◽  
Dermot Antoniades

Hyperspectral imaging has recently emerged in the geosciences as a technology that provides rapid, accurate, and high-resolution information from lake sediment cores. Here we introduce a new methodology to infer particle size distribution, an insightful proxy that tracks past changes in aquatic ecosystems and their catchments, from laboratory hyperspectral images of lake sediment cores. The proposed methodology includes data preparation, spectral preprocessing and transformation, variable selection, and model fitting. We evaluated random forest regression and other commonly used statistical methods to find the best model for particle size determination. We tested the performance of combinations of spectral transformation techniques, including absorbance, continuum removal, and first and second derivatives of the reflectance and absorbance, along with different regression models including partial least squares, multiple linear regression, principal component regression, and support vector regression, and evaluated the resulting root mean square error (RMSE), R-squared, and mean relative error (MRE). Our results show that a random forest regression model built on spectra absorbance significantly outperforms all other models. The new workflow demonstrated herein represents a much-improved method for generating inferences from hyperspectral imagery, which opens many new opportunities for advancing the study of sediment archives.


2019 ◽  
Vol 59 (6) ◽  
pp. 1190 ◽  
Author(s):  
A. Bahri ◽  
S. Nawar ◽  
H. Selmi ◽  
M. Amraoui ◽  
H. Rouissi ◽  
...  

Rapid measurement optical techniques have the advantage over traditional methods of being faster and non-destructive. In this work visible and near-infrared spectroscopy (vis-NIRS) was used to investigate differences between measured values of key milk properties (e.g. fat, protein and lactose) in 30 samples of ewes milk according to three feed systems; faba beans, field peas and control diet. A mobile fibre-optic vis-NIR spectrophotometer (350–2500 nm) was used to collect reflectance spectra from milk samples. Principal component analysis was used to explore differences between milk samples according to the feed supplied, and a partial least-squares regression and random forest regression were adopted to develop calibration models for the prediction of milk properties. Results of the principal component analysis showed clear separation between the three groups of milk samples according to the diet of the ewes throughout the lactation period. Milk fat, protein and lactose were predicted with good accuracy by means of partial least-squares regression (R2 = 0.70–0.83 and ratio of prediction deviation, which is the ratio of standard deviation to root mean square error of prediction = 1.85–2.44). However, the best prediction results were obtained with random forest regression models (R2 = 0.86–0.90; ratio of prediction deviation = 2.73–3.26). The adoption of the vis-NIRS coupled with multivariate modelling tools can be recommended for exploring to differences between milk samples according to different feed systems, and to predict key milk properties, based particularly on the random forest regression modelling technique.


Author(s):  
Kyilai Lai Khine ◽  
ThiThi Soe Nyunt

Nowadays, exponential growth in geospatial or spatial data all over the globe, geospatial data analytics is absolutely deserved to pay attention in manipulating voluminous amount of geodata in various forms increasing with high velocity. In addition, dimensionality reduction has been playing a key role in high-dimensional big data sets including spatial data sets which are continuously growing not only in observations but also in features or dimensions. In this paper, predictive analytics on geospatial big data using Principal Component Regression (PCR), traditional Multiple Linear Regression (MLR) model improved with Principal Component Analysis (PCA), is implemented on distributed, parallel big data processing platform. The main objective of the system is to improve the predictive power of MLR model combined with PCA which reduces insignificant and irrelevant variables or dimensions of that model. Moreover, it is contributed to present how data mining and machine learning approaches can be efficiently utilized in predictive geospatial data analytics. For experimentation, OpenStreetMap (OSM) data is applied to develop a one-way road prediction for city Yangon, Myanmar. Experimental results show that hybrid approach of PCA and MLR can be efficiently utilized not only in road prediction using OSM data but also in improvement of traditional MLR model.


2020 ◽  
Vol 98 (Supplement_4) ◽  
pp. 444-445
Author(s):  
Juliana Young ◽  
Joseph H Skarlupka ◽  
Rafael Tassinari ◽  
Amelie Fischer ◽  
Kenneth Kalscheur ◽  
...  

Abstract The rumen microbial community is the agent that allows cattle and other ruminants to process complex plant polymers into digestible fatty acids. Traditional methods to sample rumen microbes often involve labor-intensive stomach tubing, or invasive surgeries to access the rumen lumen via cannula ports, thereby limiting the number of animals that could be sampled in a specific study. In this study, we tested the viability of using buccal swabs as a proxy of the rumen microbial contents in a timecourse experiment on eight cannulated cows. Rumen contents and buccal swabs were collected at six equally spaced timepoints, with the first timepoint being 2 hours prior to feeding. Simpson diversity and Shannon evenness estimates of the microbial counts of each sample revealed that the first timepoint had the lowest diversity and highest evenness (Tukey HSD < 0.05) out of all other timepoints. Principal component analysis confirmed that the buccal swab samples from the first timepoint were the most similar to paired rumen samples taken at the same times. Using a Random Forest Classifier analysis, we estimated the Gini importance scores for individual microbial taxa as a proxy of their uniqueness to the rumen or oral environments of the cows. We identified 18 oral-only microbial taxa that are contaminants and could be removed from future comparisons using this method. Finally, we attempted to estimate the exact relative abundance of rumen microbial taxa from buccal swab samples using paired rumen-swab data in a Random Forest Regression model. The model was found to have moderate (~38%) accuracy in cross-validation studies. Our data suggests that buccal swabs can serve as fast and suitable proxies for rumen microbial contents of dairy cattle, but that additional factors must be measured to improve direct regression of results to those of the rumen.


2020 ◽  
Vol 13 (2) ◽  
pp. 841-858 ◽  
Author(s):  
Simon Michel ◽  
Didier Swingedouw ◽  
Marie Chavent ◽  
Pablo Ortega ◽  
Juliette Mignot ◽  
...  

Abstract. Modes of climate variability strongly impact our climate and thus human society. Nevertheless, the statistical properties of these modes remain poorly known due to the short time frame of instrumental measurements. Reconstructing these modes further back in time using statistical learning methods applied to proxy records is useful for improving our understanding of their behaviour. For doing so, several statistical methods exist, among which principal component regression is one of the most widely used in paleoclimatology. Here, we provide the software ClimIndRec to the climate community; it is based on four regression methods (principal component regression, PCR; partial least squares, PLS; elastic net, Enet; random forest, RF) and cross-validation (CV) algorithms, and enables the systematic reconstruction of a given climate index. A prerequisite is that there are proxy records in the database that overlap in time with its observed variations. The relative efficiency of the methods can vary, according to the statistical properties of the mode and the proxy records used. Here, we assess the sensitivity to the reconstruction technique. ClimIndRec is modular as it allows different inputs like the proxy database or the regression method. As an example, it is here applied to the reconstruction of the North Atlantic Oscillation by using the PAGES 2k database. In order to identify the most reliable reconstruction among those given by the different methods, we use the modularity of ClimIndRec to investigate the sensitivity of the methodological setup to other properties such as the number and the nature of the proxy records used as predictors or the targeted reconstruction period. We obtain the best reconstruction of the North Atlantic Oscillation (NAO) using the random forest approach. It shows significant correlation with former reconstructions, but exhibits higher validation scores.


2019 ◽  
Vol 26 (10) ◽  
pp. 2170-2185 ◽  
Author(s):  
Bo Xiong ◽  
Sidney Newton ◽  
Vera Li ◽  
Martin Skitmore ◽  
Bo Xia

PurposeThe purpose of this paper is to present an approach to address the overfitting and collinearity problems that frequently occur in predictive cost estimating models for construction practice. A case study, modeling the cost of preliminaries is proposed to test the robustness of this approach.Design/methodology/approachA hybrid approach is developed based on the Akaike information criterion (AIC) and principal component regression (PCR). Cost information for a sample of 204 UK school building projects is collected involving elemental items, contingencies (risk) and the contractors’ preliminaries. An application to estimate the cost of preliminaries for construction projects demonstrates the method and tests its effectiveness in comparison with such competing models as: alternative regression models, three artificial neural network data mining techniques, case-based reasoning and support vector machines.FindingsThe experimental results show that the AIC–PCR approach provides a good predictive accuracy compared with the alternatives used, and is a promising alternative to avoid overfitting and collinearity.Originality/valueThis is the first time an approach integrating the AIC and PCR has been developed to offer an improvement on existing methods for estimating construction project Preliminaries. The hybrid approach not only reduces the risk of overfitting and collinearity, but also results in better predictability compared with the commonly used stepwise regression.


2019 ◽  
Vol 8 (1) ◽  
Author(s):  
Khairunnisa Khairunnisa ◽  
Rizka Pitri ◽  
Victor P Butar-Butar ◽  
Agus M Soleh

This research used CFSRv2 data as output data general circulation model. CFSRv2 involves some variables data with high correlation, so in this research is using principal component regression (PCR) and partial least square (PLS) to solve the multicollinearity occurring in CFSRv2 data. This research aims to determine the best model between PCR and PLS to estimate rainfall at Bandung geophysical station, Bogor climatology station, Citeko meteorological station, and Jatiwangi meteorological station by comparing RMSEP value and correlation value. Size used was 3×3, 4×4, 5×5, 6×6, 7×7, 8×8, 9×9, and 11×11 that was located between (-40) N - (-90) S and 1050 E -1100 E with a grid size of 0.5×0.5 The PLS model was the best model used in stastistical downscaling in this research than PCR model because of the PLS model obtained the lower RMSEP value and the higher correlation value. The best domain and RMSEP value for Bandung geophysical station, Bogor climatology station, Citeko meteorological station, and Jatiwangi meteorological station is 9 × 9 with 100.06, 6 × 6 with 194.3, 8 × 8 with 117.6, and 6 × 6 with 108.2, respectively.


2007 ◽  
Vol 90 (2) ◽  
pp. 391-404 ◽  
Author(s):  
Fadia H Metwally ◽  
Yasser S El-Saharty ◽  
Mohamed Refaat ◽  
Sonia Z El-Khateeb

Abstract New selective, precise, and accurate methods are described for the determination of a ternary mixture containing drotaverine hydrochloride (I), caffeine (II), and paracetamol (III). The first method uses the first (D1) and third (D3) derivative spectrophotometry at 331 and 315 nm for the determination of (I) and (III), respectively, without interference from (II). The second method depends on the simultaneous use of the first derivative of the ratio spectra (DD1) with measurement at 312.4 nm for determination of (I) using the spectrum of 40 μg/mL (III) as a divisor or measurement at 286.4 and 304 nm after using the spectrum of 4 μg/mL (I) as a divisor for the determination of (II) and (III), respectively. In the third method, the predictive abilities of the classical least-squares, principal component regression, and partial least-squares were examined for the simultaneous determination of the ternary mixture. The last method depends on thin-layer chromatography-densitometry after separation of the mixture on silica gel plates using ethyl acetatechloroformmethanol (16 + 3 + 1, v/v/v) as the mobile phase. The spots were scanned at 281, 272, and 248 nm for the determination of (I), (II), and (III), respectively. Regression analysis showed good correlation in the selected ranges with excellent percentage recoveries. The chemical variables affecting the analytical performance of the methodology were studied and optimized. The methods showed no significant interferences from excipients. Intraday and interday assay precision and accuracy values were within regulatory limits. The suggested procedures were checked using laboratory-prepared mixtures and were successfully applied for the analysis of their pharmaceutical preparations. The validity of the proposed methods was further assessed by applying a standard addition technique. The results obtained by applying the proposed methods were statistically analyzed and compared with those obtained by the manufacturer's method.


Energies ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 1809
Author(s):  
Mohammed El Amine Senoussaoui ◽  
Mostefa Brahami ◽  
Issouf Fofana

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.


Sign in / Sign up

Export Citation Format

Share Document