scholarly journals A Comparative Study of Various Methods of Handling Missing Data in UNSODA

Agriculture ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 727
Author(s):  
Yingpeng Fu ◽  
Hongjian Liao ◽  
Longlong Lv

UNSODA, a free international soil database, is very popular and has been used in many fields. However, missing soil property data have limited the utility of this dataset, especially for data-driven models. Here, three machine learning-based methods, i.e., random forest (RF) regression, support vector (SVR) regression, and artificial neural network (ANN) regression, and two statistics-based methods, i.e., mean and multiple imputation (MI), were used to impute the missing soil property data, including pH, saturated hydraulic conductivity (SHC), organic matter content (OMC), porosity (PO), and particle density (PD). The missing upper depths (DU) and lower depths (DL) for the sampling locations were also imputed. Before imputing the missing values in UNSODA, a missing value simulation was performed and evaluated quantitatively. Next, nonparametric tests and multiple linear regression were performed to qualitatively evaluate the reliability of these five imputation methods. Results showed that RMSEs and MAEs of all features fluctuated within acceptable ranges. RF imputation and MI presented the lowest RMSEs and MAEs; both methods are good at explaining the variability of data. The standard error, coefficient of variance, and standard deviation decreased significantly after imputation, and there were no significant differences before and after imputation. Together, DU, pH, SHC, OMC, PO, and PD explained 91.0%, 63.9%, 88.5%, 59.4%, and 90.2% of the variation in BD using RF, SVR, ANN, mean, and MI, respectively; and this value was 99.8% when missing values were discarded. This study suggests that the RF and MI methods may be better for imputing the missing data in UNSODA.

2021 ◽  
pp. 0734242X2110606
Author(s):  
Maliheh Fouladidorhani ◽  
Mohammad Shayannejad ◽  
Emmanuel Arthur

One of the approaches for recycling and reusing agricultural and animal wastes is to pyrolyse the residues and subsequently use them as soil amendments. The prevalence of several feedstocks suggests that it is necessary to investigate the optimal combinations of feedstocks and pyrolysis temperature for use as soil amendments. This study was done to evaluate five combinations of raw materials (sugarcane bagasse, rice husk, cow manure and pine wood) and their biochars produced by slow pyrolysis at 300°C and 500°C for soil amendment. Several physicochemical properties (electrical conductivity (EC), pH, cation exchange capacity (CEC), total organic matter content (C) total porosity (TP), total nitrogen (N), particle density (PD) and bulk density (BD)) were investigated. Comparison among feedstocks showed that the highest PD, BD and CEC were observed in WM (cow manure-pine wood). The pyrolysis process increased the PD, TP, N and monovalent cations and decreased EC, CEC and BD. Compared to the feedstock, pyrolysis increased the N content, but higher temperatures lowered the N content. Pyrolysis at 500°C reduced the EC, N, CEC and biochar yield by 18%, 13%, 21% and 24% respectively, compared to 300°C. Pyrolysis at 500°C increased the pH, Na+ and K+ by 17%, 12% and 22%, respectively, compared to 300°C. Considering the physicochemical properties of biochar and the costs, the bagasse-wood-rice (BWR) combination and temperature of 300°C are suggested for biochar production for soil amendment.


2021 ◽  
Vol 30 (2) ◽  
pp. 141-149
Author(s):  
Tasnim Zannat ◽  
Farhana Firoz Meem ◽  
Rubaiat Sharmin Promi ◽  
Umme Qulsum Poppy ◽  
MK Rahman

Twelve soil and twelve leaf samples were collected from twelve litchi (Litchi chinensis Sonn.) orchards from different locations of Dinajpur to evaluate some physico-chemical properties and nutrient status of soil, and concentration of nutrients in litchi leaf. The pH of the soil varied from very strong acidic to medium acidic (4.8 - 5.7), organic matter content varied from 0.84 - 1.88%, EC varied from 302.4 - 310.2 μS/cm. The dominant soil textural class was clay loam. The average particle density was 2.49g/cm3. Total N, P, K and S in soils were 0.053 - 0.180%, 0.02 - 0.07%, 0.046 - 0.370 meq/100 g, and 0.015 - 0.028%, respectively. Available N, P, K, S, Zn, Fe, Mn and B in soils 30.40 - 57.8 mg/kg, 10.53 - 14.33 mg/kg, 0.03 - 0.32 meq/100 g, 20.03-34.80 mg/kg, 0.68-1.50 μg/g, 31.8 - 41.5 μg/g, 6.75 - 7.39 μg/g and 0.25-0.51 μg/g, respectively. The concentration of total N, P, K, S, Zn and Mn in the leaf were 1.74 - 2.20%, 0.11 - 0.188%, 0.104- 0.198%, 0.129 - 0.430%, 12 - 14 μg/g and 30 - 74 μg/g, respectively. The overall results indicated that the fertility status of the soils under the litchi plantation in the Dinajpur area are medium fertile. So, farmers could be advised to grow litchi plants after applying amendments to the soils to improve the physico-chemical properties in the Dinajpur area of Bangladesh. Dhaka Univ. J. Biol. Sci. 30(2): 141-149, 2021 (July)


Author(s):  
Juheng Zhang ◽  
Xiaoping Liu ◽  
Xiao-Bai Li

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.


2019 ◽  
Vol 78 (1) ◽  
pp. 35-45 ◽  
Author(s):  
Beata Bosiacka ◽  
Helena Więcław ◽  
Paweł Marciniuk ◽  
Marek Podlasiński

Abstract The vegetation of protected salt meadows along the Baltic coast is fairly well known; however, dandelions have been so far treated as a collective species. The aim of our study was to examine the microspecies diversity of the genus Taraxacum in Polish salt and brackish coastal meadows and to analyse soil property preferences of the dandelion microspecies identified. In addition, we analysed the relations between soil properties and vegetation patterns in dandelion-supporting coastal meadows (by canonical correspondence analysis). The salt and brackish meadows along the Polish Baltic coast we visited were found to support a total of 27 dandelion microspecies representing 5 sections. Analysis of vegetation patterns showed all the soil parameters (C:N ratio, organic matter content, pH, concentration of Mg, P, K, electrolytic conductivity of the saturated soil extract ECe) to explain 32.07% of the total variance in the species data. The maximum abundance of most dandelion microspecies was associated with the highest soil fertility, moderate pH values and organic matter content, and with the lowest magnesium content and soil salinity. The exceptions were T. latissimum, T. stenoglossum, T. pulchrifolium and T. lucidum the occur-rence of which was related to the lowest soil fertility and the highest salinity. In addition, several microspecies (T. leptodon, T. gentile, T. haematicum, T. fusciflorum and T. balticum) were observed at moderate C:N ratios and ECe. Four other microspecies (T. infestum, T. cordatum, T. hamatum, T. sertatum) occurred at the lowest pH and organic matter content. The information obtained increases the still insufficient body of knowledge on ecological spectra of individual dandelion microspecies, hence their potential indicator properties.


2020 ◽  
Vol 07 (02) ◽  
pp. 161-177
Author(s):  
Oyekale Abel Alade ◽  
Ali Selamat ◽  
Roselina Sallehuddin

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.


2020 ◽  
Vol 4 (Supplement_1) ◽  
pp. 509-509
Author(s):  
Peiyi Lu ◽  
Mack Shelley

Abstract Studies using data from longitudinal health survey of older adults usually assumed the data were missing completely at random (MCAR) or missing at random (MAR). Thus subsequent analyses used multiple imputation or likelihood-based method to handle missing data. However, little existing research actually examines whether the data met the MCAR/MAR assumptions before performing data analyses. This study first summarized the commonly used statistical methods to test missing mechanism and discussed their application conditions. Then using two-wave longitudinal data from the Health and Retirement Study (HRS; wave 2014-2015 and wave 2016-2017; N=18,747), this study applied different approaches to test the missing mechanism of several demographic and health variables. These approaches included Little’s test, logistic regression method, nonparametric tests, false discovery rate, and others. Results indicated the data did not meet the MCAR assumption even though they had a very low rate of missing values. Demographic variables provided good auxiliary information for health variables. Health measures (e.g., self-reported health, activity of daily life, depressive symptoms) met the MAR assumptions. Older respondents could drop out and die in the longitudinal survey, but attrition did not significantly affect the MAR assumption. Our findings supported the MAR assumptions for the demographic and health variables in HRS, and therefore provided statistical justification to HRS researchers about using imputation or likelihood-based methods to deal with missing data. However, researchers are strongly encouraged to test the missing mechanism of the specific variables/data they choose when using a new dataset.


2020 ◽  
Vol 3 (2) ◽  
pp. 353-365
Author(s):  
Babita Neupane ◽  
Krishna Aryal ◽  
Lal Bahadur Chhetri ◽  
Shishir Regmi

This experiment was conducted in the farmer’s field at Khajrauta, Gadhawa-4, Dang, Nepal to evaluate the effect of integrated nutrient management on growth and yield of cauliflower as well as their residual effects on soil properties. The cauliflower variety silvercup-60 was grown under eight different treatments; T1: 50% N through RDF + 50% N through FYM; T2: 50% N through RDF + 50% N through PM; T3: 50% N through RDF + 50% N through VC, T4: 50% N through RDF + 25% N through FYM + 25% N through PM; T5: 50% N through RDF + 25% N through VC + 25% N through PM; T6: 50% N through RDF + 25% N through VC + 25% N through FYM; T7: 50% N through RDF + 25% N through  VC +25% N through FYM; T8: 50% N through RDF + 50% N  through FYM,VC and poultry manure. The experiment was laid out in RCB design with three replications. The result revealed that the  highest plant height (36.40 cm), number of leaves (15), plant spread (31.72 cm), leaf area (526.5 cm2), curd weight (207.3g) and curd yield (12.85 t/ha) were found under 50% N through RDF +50% N through VC. The root length, root diameter and root density were better under all INM treatments as compared to 100% N through RDF. INM treatments showed lesser bulk density, lesser particle density, greater infiltration rate and greater organic matter content than application of 100% N through RDF. Soil total nitrogen was increased in all INM treatments while soil available phosphorus decreases in all treatments except 100% N trough RDF and 50% N through RDF +50% N through PM. Thus, farmers are suggested to apply 50% N through VC along with 50% N through RDF to increase cauliflower yield.   


2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Leandro Campos Pinto ◽  
Wantuir Filipe Teixeira Chagas ◽  
Francisco Hélcio Canuto Amaral

The relationship of management and soil quality may be evaluated by the behavior of soil physical, chemical and biological properties. In the assessment of soil structure, it is sought attributes in the view of measuring the porosity and the distribution of pores by size and its implication to permeability and rigidity of the pores, as well as the stability of the units that composes soil structure. The aim of this research was to assess the structure of a Dystroferric Red Latosol (Oxisol) under conventional corn crop, conventional coffee crop, eucalyptus crop and an equilibrium reference (native vegetation), by the determination of the particle density, bulk density, calculated total porosity, microporosity, macroporosity, moisture saturation, determined total porosity, blocked pores and aggregated stability. Soil under native vegetation presented the lowest values of particle density, probably due to the greatest soil organic matter content in this environment. It was verified a tendency of increasing blocked pores and decreasing bulk density. As expected, bulk density varied from 0.87 to 1.03 g cm-3, showing an inversely proportional distribution related to total porosity. The largest values of geometric mean diameter presented by the soil under native vegetation are due to thegreater structuration degree of this soil, which contributes to the stabilization of the aggregates in this environment. The native vegetation environment presented a better soil physical quality in relation to other land uses.


2022 ◽  
Vol 43 (1) ◽  
pp. 7-24
Author(s):  
Iris Mariane da Silva Martins ◽  
◽  
Tatiane Carla Silva ◽  
Maria Julia Betiolo Troleis ◽  
Paulino Taveira de Souza ◽  
...  

Effects of soil attributes using the geostatistical tool improves the interpretation of specific soil management. Thus, this study aimed to evaluate the physical, chemical, and microbiological properties of a Typical Haplustox (Oxisol), identifying those with the best linear and spatial correlation with eucalyptus (Eucalyptus spp.) vegetative growth. The experiment was conducted at the Teaching, Research, and Extension Farm (FEPE) of the Universidade Estadual Paulista (UNESP), Campus of Ilha Solteira. Thirty-five points spaced 13 meters apart were demarcated for analysis, which were distributed in 5 rows of 7 points each. From each point, 2 soil samples were collected from the 0-10 cm depth layer. The physical, chemical, and microbiological soil properties evaluated were: sand, silt, and clay contents; penetration resistance (PR), gravimetric moisture (GM), real density (RD), microbial biomass carbon (MBC), respirometry (CO2-C), metabolic quotient (qCO2), organic matter content (OM), and hydrogenionic potential (pH). The eucalyptus attributes assessed were: plant height (PH) and circumference at breast height (CBH). Each attribute was analyzed by descriptive statistics using the SAS software. Data frequency distribution was verified by the Shapiro Wilk method, and geospatial changes were analyzed by the GS+ software. The soil property that best explained the variability in eucalyptus dendrometric attributes was real density (RD). Except for RD, all properties did not show spatial dependence (i.e., pure nugget effect), which significantly represents eucalyptus vegetative performance.


Author(s):  
Wisam A. Mahmood ◽  
Mohammed S. Rashid ◽  
Teaba Wala Aldeen ◽  
Teaba Wala Aldeen

Missing values commonly happen in the realm of medical research, which is regarded creating a lot of bias in case it is neglected with poor handling. However, while dealing with such challenges, some standard statistical methods have been already developed and available, yet no credible method is available so far to infer credible estimates. The existing data size gets lowered, apart from a decrease in efficiency happens when missing values is found in a dataset. A number of imputation methods have addressed such challenges in early scholarly works for handling missing values. Some of the regular methods include complete case method, mean imputation method, Last Observation Carried Forward (LOCF) method, Expectation-Maximization (EM) algorithm, and Markov Chain Monte Carlo (MCMC), Mean Imputation (Mean), Hot Deck (HOT), Regression Imputation (Regress), K-nearest neighbor (KNN),K-Mean Clustering, Fuzzy K-Mean Clustering, Support Vector Machine, and Multiple Imputation (MI) method. In the present paper, a simulation study is attempted for carrying out an investigative exploration into the efficacy of the above mentioned archetypal imputation methods along with longitudinal data setting under missing completely at random (MCAR). We took out missingness from three cases in a block having low missingness of 5% as well as higher levels at 30% and 50%. With this simulation study, we concluded LOCF method having more bias than the other methods in most of the situations after carrying out a comparison through simulation study.


Sign in / Sign up

Export Citation Format

Share Document