Multiple linear regression and random forest to predict and map soil properties using data from portable X-ray fluorescence spectrometer (pXRF)

ABSTRACT Determination of soil properties helps in the correct management of soil fertility. The portable X-ray fluorescence spectrometer (pXRF) has been recently adopted to determine total chemical element contents in soils, allowing soil property inferences. However, these studies are still scarce in Brazil and other countries. The objectives of this work were to predict soil properties using pXRF data, comparing stepwise multiple linear regression (SMLR) and random forest (RF) methods, as well as mapping and validating soil properties. 120 soil samples were collected at three depths and submitted to laboratory analyses. pXRF was used in the samples and total element contents were determined. From pXRF data, SMLR and RF were used to predict soil laboratory results, reflecting soil properties, and the models were validated. The best method was used to spatialize soil properties. Using SMLR, models had high values of R² (≥0.8), however the highest accuracy was obtained in RF modeling. Exchangeable Ca, Al, Mg, potential and effective cation exchange capacity, soil organic matter, pH, and base saturation had adequate adjustment and accurate predictions with RF. Eight out of the 10 soil properties predicted by RF using pXRF data had CaO as the most important variable helping predictions, followed by P2O5, Zn and Cr. Maps generated using RF from pXRF data had high accuracy for six soil properties, reaching R2 up to 0.83. pXRF in association with RF can be used to predict soil properties with high accuracy at low cost and time, besides providing variables aiding digital soil mapping.

Download Full-text

Evaluating and Improving the Reliability of Gas-Phase Sensor System Calibrations Across New Locations for Ambient Measurements and Personal Exposure Monitoring

10.5194/amt-2019-30 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sharad Vikram ◽

Ashley Collier-Oxandale ◽

Michael Ostertag ◽

Massimiliano Menarini ◽

Camron Chermak ◽

...

Keyword(s):

Neural Network ◽

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Environmental Conditions ◽

Low Cost ◽

Calibration Model ◽

Architecture Model ◽

Pollutant Concentrations ◽

Sensor Package

Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple-linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross-sensitivities with non-target pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse dataset. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multi-site approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting, and confirms prior results that transfer is a significant source of both bias and standard error. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration. We contribute a new neural network architecture model termed split-NN that splits the model into two-stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional 2- and 4-layer neural network, and random forest models.

Download Full-text

Evaluating and improving the reliability of gas-phase sensor system calibrations across new locations for ambient measurements and personal exposure monitoring

Atmospheric Measurement Techniques ◽

10.5194/amt-12-4211-2019 ◽

2019 ◽

Vol 12 (8) ◽

pp. 4211-4239 ◽

Cited By ~ 5

Author(s):

Sharad Vikram ◽

Ashley Collier-Oxandale ◽

Michael H. Ostertag ◽

Massimiliano Menarini ◽

Camron Chermak ◽

...

Keyword(s):

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Environmental Conditions ◽

Low Cost ◽

Calibration Model ◽

Data Set ◽

Architecture Model ◽

Pollutant Concentrations ◽

Sensor Package

Abstract. Advances in ambient environmental monitoring technologies are enabling concerned communities and citizens to collect data to better understand their local environment and potential exposures. These mobile, low-cost tools make it possible to collect data with increased temporal and spatial resolution, providing data on a large scale with unprecedented levels of detail. This type of data has the potential to empower people to make personal decisions about their exposure and support the development of local strategies for reducing pollution and improving health outcomes. However, calibration of these low-cost instruments has been a challenge. Often, a sensor package is calibrated via field calibration. This involves colocating the sensor package with a high-quality reference instrument for an extended period and then applying machine learning or other model fitting technique such as multiple linear regression to develop a calibration model for converting raw sensor signals to pollutant concentrations. Although this method helps to correct for the effects of ambient conditions (e.g., temperature) and cross sensitivities with nontarget pollutants, there is a growing body of evidence that calibration models can overfit to a given location or set of environmental conditions on account of the incidental correlation between pollutant levels and environmental conditions, including diurnal cycles. As a result, a sensor package trained at a field site may provide less reliable data when moved, or transferred, to a different location. This is a potential concern for applications seeking to perform monitoring away from regulatory monitoring sites, such as personal mobile monitoring or high-resolution monitoring of a neighborhood. We performed experiments confirming that transferability is indeed a problem and show that it can be improved by collecting data from multiple regulatory sites and building a calibration model that leverages data from a more diverse data set. We deployed three sensor packages to each of three sites with reference monitors (nine packages total) and then rotated the sensor packages through the sites over time. Two sites were in San Diego, CA, with a third outside of Bakersfield, CA, offering varying environmental conditions, general air quality composition, and pollutant concentrations. When compared to prior single-site calibration, the multisite approach exhibits better model transferability for a range of modeling approaches. Our experiments also reveal that random forest is especially prone to overfitting and confirm prior results that transfer is a significant source of both bias and standard error. Linear regression, on the other hand, although it exhibits relatively high error, does not degrade much in transfer. Bias dominated in our experiments, suggesting that transferability might be easily increased by detecting and correcting for bias. Also, given that many monitoring applications involve the deployment of many sensor packages based on the same sensing technology, there is an opportunity to leverage the availability of multiple sensors at multiple sites during calibration to lower the cost of training and better tolerate transfer. We contribute a new neural network architecture model termed split-NN that splits the model into two stages, in which the first stage corrects for sensor-to-sensor variation and the second stage uses the combined data of all the sensors to build a model for a single sensor package. The split-NN modeling approach outperforms multiple linear regression, traditional two- and four-layer neural networks, and random forest models. Depending on the training configuration, compared to random forest the split-NN method reduced error 0 %–11 % for NO2 and 6 %–13 % for O3.

Download Full-text

Calibrations of Low-Cost Air Pollution Monitoring Sensors for CO, NO2, O3, and SO2

Sensors ◽

10.3390/s21010256 ◽

2021 ◽

Vol 21 (1) ◽

pp. 256

Author(s):

Pengfei Han ◽

Han Mei ◽

Di Liu ◽

Ning Zeng ◽

Xiao Tang ◽

...

Keyword(s):

Random Forest ◽

Linear Regression ◽

Short Term Memory ◽

Hot Spot ◽

Low Cost ◽

Field Evaluation ◽

Pollution Monitoring ◽

Coefficient Of Determination ◽

Air Pollution Monitoring ◽

National Monitoring

Pollutant gases, such as CO, NO2, O3, and SO2 affect human health, and low-cost sensors are an important complement to regulatory-grade instruments in pollutant monitoring. Previous studies focused on one or several species, while comprehensive assessments of multiple sensors remain limited. We conducted a 12-month field evaluation of four Alphasense sensors in Beijing and used single linear regression (SLR), multiple linear regression (MLR), random forest regressor (RFR), and neural network (long short-term memory (LSTM)) methods to calibrate and validate the measurements with nearby reference measurements from national monitoring stations. For performances, CO > O3 > NO2 > SO2 for the coefficient of determination (R2) and root mean square error (RMSE). The MLR did not increase the R2 after considering the temperature and relative humidity influences compared with the SLR (with R2 remaining at approximately 0.6 for O3 and 0.4 for NO2). However, the RFR and LSTM models significantly increased the O3, NO2, and SO2 performances, with the R2 increasing from 0.3–0.5 to >0.7 for O3 and NO2, and the RMSE decreasing from 20.4 to 13.2 ppb for NO2. For the SLR, there were relatively larger biases, while the LSTMs maintained a close mean relative bias of approximately zero (e.g., <5% for O3 and NO2), indicating that these sensors combined with the LSTMs are suitable for hot spot detection. We highlight that the performance of LSTM is better than that of random forest and linear methods. This study assessed four electrochemical air quality sensors and different calibration models, and the methodology and results can benefit assessments of other low-cost sensors.

Download Full-text

CAMS-Net: The Clean Air Monitoring and Solutions Network

10.5194/egusphere-egu21-13912 ◽

2021 ◽

Author(s):

Daniel Westervelt ◽

Celeste McFarlane ◽

Faye McNeill ◽

R (Subu) Subramanian ◽

Mike Giordano ◽

...

Keyword(s):

Air Pollution ◽

Air Quality ◽

Linear Regression ◽

Multiple Linear Regression ◽

Low Cost ◽

Air Monitoring ◽

Quality Data ◽

Calibration Methods ◽

The World ◽

Clean Air

There is a severe lack of air pollution data around the world. This includes large portions of low- and middle-income countries (LMICs), as well as rural areas of wealthier nations as monitors tend to be located in large metropolises. Low cost sensors (LCS) for measuring air pollution and identifying sources offer a possible path forward to remedy the lack of data, though significant knowledge gaps and caveats remain regarding the accurate application and interpretation of such devices.The Clean Air Monitoring and Solutions Network (CAMS-Net) establishes an international network of networks that unites scientists, decision-makers, city administrators, citizen groups, the private sector, and other local stakeholders in co-developing new methods and best practices for real-time air quality data collection, data sharing, and solutions for air quality improvements. CAMS-Net brings together at least 32 multidisciplinary member networks from North America, Europe, Africa, and India. The project establishes a mechanism for international collaboration, builds technical capacity, shares knowledge, and trains the next generation of air quality practitioners and advocates, including domestic and international graduate students and postdoctoral researchers.&#160;Here we present some preliminary research accelerated through the CAMS-Net project. Specifically, we present LCS calibration methodology for several co-locations in LMICs (Accra, Ghana; Kampala, Uganda; Nairobi, Kenya; Addis Ababa, Ethiopia; and Kolkata, India), in which reference BAM-1020 PM2.5 monitors were placed side-by-side with LCS. We demonstrate that both simple multiple linear regression calibration methods for bias-correcting LCS and more complex machine learning methods can reduce bias in LCS to close to zero, while increasing correlation. For example, in Kampala, Raw PurpleAir PM2.5 data are strongly correlated with the BAM-1020 PM2.5 (r2 = 0.88), but have a mean bias of approximately 12 &#956;g m-3. Two calibration models, multiple linear regression and a random forest approach, decrease mean bias from 12 &#956;g m-3 to -1.84 &#181;g m-3 or less and improve the the r2 from 0.88 to 0.96. We find similar performance in several other regions of the world. Location-specific calibration of low-cost sensors is necessary in order to obtain useful data, since sensor performance is closely tied to environmental conditions such as relative humidity. This work is a first step towards developing a database of region-specific correction factors for low cost sensors, which are exploding in popularity globally and have the potential to close the air pollution data gap especially in resource-limited countries.&#160;&#160;&#160;

Download Full-text

COMPARISON OF RANDOM FOREST AND MULTIPLE LINEAR REGRESSION TO MODEL THE MASS BALANCE OF BIOSOLIDS FROM A COMPLEX BIOSOLIDS MANAGEMENT AREA

Water Environment Research ◽

10.1002/wer.1668 ◽

2021 ◽

Author(s):

Thaís Bremm Pluth ◽

Dominic A. Brose

Keyword(s):

Random Forest ◽

Linear Regression ◽

Multiple Linear Regression ◽

Mass Balance ◽

Management Area

Download Full-text

Performance of NO, NO2 low cost sensors and three calibration approaches within a real world application

Atmospheric Measurement Techniques ◽

10.5194/amt-11-3717-2018 ◽

2018 ◽

Vol 11 (6) ◽

pp. 3717-3735 ◽

Cited By ~ 17

Author(s):

Alessandro Bigi ◽

Michael Mueller ◽

Stuart K. Grange ◽

Grazia Ghermandi ◽

Christoph Hueglin

Keyword(s):

Random Forest ◽

Linear Regression ◽

Support Vector Regression ◽

Low Cost ◽

Multivariate Linear Regression ◽

Pollution Level ◽

Support Vector ◽

Non Linear ◽

Linear Algorithms ◽

Urban Sites

Abstract. Low cost sensors for measuring atmospheric pollutants are experiencing an increase in popularity worldwide among practitioners, academia and environmental agencies, and a large amount of data by these devices are being delivered to the public. Notwithstanding their behaviour, performance and reliability are not yet fully investigated and understood. In the present study we investigate the medium term performance of a set of NO and NO2 electrochemical sensors in Switzerland using three different regression algorithms within a field calibration approach. In order to mimic a realistic application of these devices, the sensors were initially co-located at a rural regulatory monitoring site for a 4-month calibration period, and subsequently deployed for 4 months at two distant regulatory urban sites in traffic and urban background conditions, where the performance of the calibration algorithms was explored. The applied algorithms were Multivariate Linear Regression, Support Vector Regression and Random Forest; these were tested, along with the sensors, in terms of generalisability, selectivity, drift, uncertainty, bias, noise and suitability for spatial mapping intra-urban pollution gradients with hourly resolution. Results from the deployment at the urban sites show a better performance of the non-linear algorithms (Support Vector Regression and Random Forest) achieving RMSE <  5 ppb, R2 between 0.74 and 0.95 and MAE between 2 and 4 ppb. The combined use of both NO and NO2 sensor output in the estimate of each pollutant showed some contribution by NO sensor to NO2 estimate and vice-versa. All algorithms exhibited a drift ranging between 5 and 10 ppb for Random Forest and 15 ppb for Multivariate Linear Regression at the end of the deployment. The lowest concentration correctly estimated, with a 25 % relative expanded uncertainty, resulted in ca. 15–20 ppb and was provided by the non-linear algorithms. As an assessment for the suitability of the tested sensors for a targeted application, the probability of resolving hourly concentration difference in cities was investigated. It was found that NO concentration differences of 5–10 ppb (8–10 for NO2) can reliably be detected (90 % confidence), depending on the air pollution level. The findings of this study, although derived from a specific sensor type and sensor model, are based on a flexible methodology and have extensive potential for exploring the performance of other low cost sensors, that are different in their target pollutant and sensing technology.

Download Full-text

A Multiple Linear Regression Based High-Accuracy Error Prediction Algorithm for Reversible Data Hiding

Digital Forensics and Watermarking - Lecture Notes in Computer Science ◽

10.1007/978-3-030-11389-6_15 ◽

2019 ◽

pp. 195-205

Author(s):

Bin Ma ◽

Xiaoyu Wang ◽

Bing Li ◽

Yun-Qing Shi

Keyword(s):

Linear Regression ◽

Multiple Linear Regression ◽

Data Hiding ◽

Reversible Data Hiding ◽

High Accuracy ◽

Prediction Algorithm ◽

Error Prediction

Download Full-text

IoT-based Estimation System for Microcystis aeruginosa Cyanobacteria in Laguna de Bay using an Arduino-controlled Spectrophotometric Device

E3S Web of Conferences ◽

10.1051/e3sconf/202132504007 ◽

2021 ◽

Vol 325 ◽

pp. 04007

Author(s):

Lawrence D. Alejandrino ◽

Jessica Joy D. Jocson ◽

Micah Romina R. Mirarza ◽

Ericson D. Dimaunahan ◽

Ramon G Garcia ◽

...

Keyword(s):

Support Vector Machine ◽

Linear Regression ◽

Potable Water ◽

Low Cost ◽

High Accuracy ◽

The Philippines ◽

Support Vector ◽

Svm Algorithm ◽

Laguna De Bay ◽

Estimation System

Laguna de Bay, the largest freshwater lake in the Philippines, provides livelihood to the fishermen and serves as a source of potable water to the locals. However, freshwater quality has degraded, whereas one of the main contributors are Cyanobacteria that produce cyanotoxins. Existing studies that uses a similar device are either too expensive or too bulky. The purpose of this study is to estimate the cyanobacteria concentration by using a low-cost 16-channel spectrophotometric device to determine the level of severity efficiently. Using Linear Regression, the dataset is modelled by the algorithm to estimate the number of cyanobacteria present on the water sample, while Support Vector Machine (SVM) algorithm for severity level classifier. This study achieved high accuracy in estimating the cyanobacteria using linear regression and classifying the level of severity by support vector machine.

Download Full-text

Application of Machine Learning Algorithms in Predicting Pyrolytic Analysis Result

IOP Conference Series Earth and Environmental Science ◽

10.1088/1755-1315/931/1/012013 ◽

2021 ◽

Vol 931 (1) ◽

pp. 012013

Author(s):

Le Thi Nhut Suong ◽

A V Bondarev ◽

E V Kozlova

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Decision Tree ◽

Multiple Linear Regression ◽

Oil And Gas ◽

Polynomial Regression ◽

Source Rocks ◽

Regression Algorithms ◽

Before And After

Abstract Geochemical studies of organic matter in source rocks play an important role in predicting the oil and gas accumulation of any territory, especially in oil and gas shale. For deep understanding, pyrolytic analyses are often carried out on samples before and after extraction of hydrocarbon with chloroform. However, extraction is a laborious and time-consuming process and the workload of laboratory equipment and time doubles. In this work, machine learning regression algorithms is applied for forecasting S2ex based on the pyrolytic analytic result of non-extracted samples. This study is carried out using more than 300 samples from 3 different wells in Bazhenov formation, Western Siberia. For developing a prediction model, 5 different machine learning regression algorithms including Multiple Linear Regression, Polynomial Regression, Support vector regression, Decision tree and Random forest have been tested and compared. The performance of these algorithms is examined by R-squared coefficient. The data of the X2 well was used for building a model. Simultaneously, this data is divided into 2 parts – 80% for training and 20% for checking. The model also was used for prediction of wells X1 and X3. Then, these predictive results were compared with the real results, which had been obtained from standard experiments. Despite limited amount of data, the result exceeded all expectations. The result of prediction also showcases that the relationship between before and after extraction parameters are complex and non-linear. The proof is R2 value of Multiple Linear Regression and Polynomial Regression is negative, which means the model is broken. However, Random forest and Decision tree give us a good performance. With the same algorithms, we can apply for prediction all geochemical parameters by depth or utilize them for well-logging data.

Download Full-text

Evaluation of the Performance of Low-Cost Air Quality Sensors at a High Mountain Station with Complex Meteorological Conditions

Atmosphere ◽

10.3390/atmos11020212 ◽

2020 ◽

Vol 11 (2) ◽

pp. 212 ◽

Cited By ~ 1

Author(s):

Hongyong Li ◽

Yujiao Zhu ◽

Yong Zhao ◽

Tianshu Chen ◽

Ying Jiang ◽

...

Keyword(s):

Air Quality ◽

Wind Speed ◽

Random Forest ◽

Linear Regression ◽

Reference Data ◽

Low Cost ◽

Weather Conditions ◽

Air Pressure ◽

High Mountain ◽

Positive Effect

Low-cost sensors have become an increasingly important supplement to air quality monitoring networks at the ground level, yet their performances have not been evaluated at high-elevation areas, where the weather conditions are complex and characterized by low air pressure, low temperatures, and high wind speed. To address this research gap, a seven-month-long inter-comparison campaign was carried out at Mt. Tai (1534 m a.s.l.) from 20 April to 30 November 2018, covering a wide range of air temperatures, relative humidities (RHs), and wind speeds. The performance of three commonly used sensors for carbon monoxide (CO), ozone (O3), and particulate matter (PM2.5) was evaluated against the reference instruments. Strong positive linear relationships between sensors and the reference data were found for CO (r = 0.83) and O3 (r = 0.79), while the PM2.5 sensor tended to overestimate PM2.5 under high RH conditions. When the data at RH >95% were removed, a strong non-linear relationship could be well fitted for PM2.5 between the sensor and reference data (r = 0.91). The impacts of temperature, RH, wind speed, and pressure on the sensor measurements were comprehensively assessed. Temperature showed a positive effect on the CO and O3 sensors, RH showed a positive effect on the PM sensor, and the influence of wind speed and air pressure on all three sensors was relatively minor. Two methods, namely a multiple linear regression model and a random forest model, were adopted to minimize the influence of meteorological factors on the sensor data. The multi-linear regression (MLR) model showed a better performance than the random forest (RF) model in correcting the sensors’ data, especially for O3 and PM2.5. Our results demonstrate the capability and potential of the low-cost sensors for the measurement of trace gases and aerosols at high mountain sites with complex weather conditions.

Download Full-text

Multiple linear regression and random forest to predict and map soil properties using data from portable X-ray fluorescence spectrometer (pXRF)

Evaluating and Improving the Reliability of Gas-Phase Sensor System Calibrations Across New Locations for Ambient Measurements and Personal Exposure Monitoring

Evaluating and improving the reliability of gas-phase sensor system calibrations across new locations for ambient measurements and personal exposure monitoring

Calibrations of Low-Cost Air Pollution Monitoring Sensors for CO, NO2, O3, and SO2

CAMS-Net: The Clean Air Monitoring and Solutions Network

COMPARISON OF RANDOM FOREST AND MULTIPLE LINEAR REGRESSION TO MODEL THE MASS BALANCE OF BIOSOLIDS FROM A COMPLEX BIOSOLIDS MANAGEMENT AREA

Performance of NO, NO<sub>2</sub> low cost sensors and three calibration approaches within a real world application

A Multiple Linear Regression Based High-Accuracy Error Prediction Algorithm for Reversible Data Hiding

IoT-based Estimation System for Microcystis aeruginosa Cyanobacteria in Laguna de Bay using an Arduino-controlled Spectrophotometric Device

Application of Machine Learning Algorithms in Predicting Pyrolytic Analysis Result

Evaluation of the Performance of Low-Cost Air Quality Sensors at a High Mountain Station with Complex Meteorological Conditions

Export Citation Format