On the benefits of clustering approaches in digital soil mapping: an application example concerning soil texture regionalization

Mapping Intimacies ◽

10.5194/soil-2020-102 ◽

2021 ◽

Author(s):

Istvan Dunkl ◽

Mareike Ließ

Keyword(s):

Soil Texture ◽

Expert Knowledge ◽

Learning Algorithm ◽

Model Performance ◽

Imbalanced Data ◽

Soil Mapping ◽

Digital Soil Mapping ◽

Training Data ◽

Data Set ◽

Environmental Covariates

Abstract. High resolution soil maps are urgently needed by land managers and researchers for a variety of applications. Digital Soil Mapping (DSM) allows to regionalize soil properties by relating them to environmental covariates with the help of an empirical model. In this study, a legacy soil data set was used to train a machine learning algorithm in order to predict the particle size distribution within the catchment of the Bode river in Saxony-Anhalt (Germany). The ensemble learning method random forest was used to predict soil texture based on environmental covariates originating from a digital elevation model, land cover data and geologic maps. We studied the usefulness of clustering applications in addressing various aspects of the DSM procedure. To investigate the role of the imbalanced data problem in the learning process, the environmental variables were used to cluster the landscape of the study area. Different sampling strategies were used to create balanced training data and were evaluated on their ability to improve model performance. Clustering applications were also involved in feature selection and stratified cross-validation. Overall, clustering applications appear to be a versatile tool to be employed at various steps of the DSM procedure. Beyond their successful application, further application fields in DSM were identified. One of them is to find adequate means to include expert knowledge.

Download Full-text

Evaluation of statistical and geostatistical models of digital soil properties mapping in tropical mountain regions

Revista Brasileira de Ciência do Solo ◽

10.1590/s0100-06832014000300003 ◽

2014 ◽

Vol 38 (3) ◽

pp. 706-717 ◽

Cited By ~ 5

Author(s):

Waldir de Carvalho Junior ◽

Cesar da Silva Chagas ◽

Philippe Lagacherie ◽

Braz Calderano Filho ◽

Silvio Barge Bhering

Keyword(s):

Soil Properties ◽

Predictive Models ◽

Vegetation Index ◽

Normalized Difference Vegetation Index ◽

Soil Mapping ◽

Digital Soil Mapping ◽

Coefficient Of Determination ◽

Band Ratio ◽

Data Set ◽

Environmental Covariates

Soil properties have an enormous impact on economic and environmental aspects of agricultural production. Quantitative relationships between soil properties and the factors that influence their variability are the basis of digital soil mapping. The predictive models of soil properties evaluated in this work are statistical (multiple linear regression-MLR) and geostatistical (ordinary kriging and co-kriging). The study was conducted in the municipality of Bom Jardim, RJ, using a soil database with 208 sampling points. Predictive models were evaluated for sand, silt and clay fractions, pH in water and organic carbon at six depths according to the specifications of the consortium of digital soil mapping at the global level (GlobalSoilMap). Continuous covariates and categorical predictors were used and their contributions to the model assessed. Only the environmental covariates elevation, aspect, stream power index (SPI), soil wetness index (SWI), normalized difference vegetation index (NDVI), and b3/b2 band ratio were significantly correlated with soil properties. The predictive models had a mean coefficient of determination of 0.21. Best results were obtained with the geostatistical predictive models, where the highest coefficient of determination 0.43 was associated with sand properties between 60 to 100 cm deep. The use of a sparse data set of soil properties for digital mapping can explain only part of the spatial variation of these properties. The results may be related to the sampling density and the quantity and quality of the environmental covariates and predictive models used.

Download Full-text

Information-Theoretic Generalization Bounds for Meta-Learning and Applications

Entropy ◽

10.3390/e23010126 ◽

2021 ◽

Vol 23 (1) ◽

pp. 126

Author(s):

Sharu Theresa Jose ◽

Osvaldo Simeone

Keyword(s):

Learning Algorithm ◽

Broad Class ◽

Performance Measure ◽

Training Data ◽

Learning To Learn ◽

Data Set ◽

Information Theoretic ◽

Meta Learning ◽

Task Training ◽

Test Sets

Meta-learning, or “learning to learn”, refers to techniques that infer an inductive bias from data corresponding to multiple related tasks with the goal of improving the sample efficiency for new, previously unobserved, tasks. A key performance measure for meta-learning is the meta-generalization gap, that is, the difference between the average loss measured on the meta-training data and on a new, randomly selected task. This paper presents novel information-theoretic upper bounds on the meta-generalization gap. Two broad classes of meta-learning algorithms are considered that use either separate within-task training and test sets, like model agnostic meta-learning (MAML), or joint within-task training and test sets, like reptile. Extending the existing work for conventional learning, an upper bound on the meta-generalization gap is derived for the former class that depends on the mutual information (MI) between the output of the meta-learning algorithm and its input meta-training data. For the latter, the derived bound includes an additional MI between the output of the per-task learning procedure and corresponding data set to capture within-task uncertainty. Tighter bounds are then developed for the two classes via novel individual task MI (ITMI) bounds. Applications of the derived bounds are finally discussed, including a broad class of noisy iterative algorithms for meta-learning.

Download Full-text

Pancreatic cancer detection using EpiDetect signatures in plasma-derived cell free DNA in high-risk patients with new onset diabetes.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.e16265 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. e16265-e16265

Author(s):

Gulfem Guler ◽

Anna Bergamaschi ◽

David Haan ◽

Michael Kesling ◽

Yuhong Ning ◽

...

Keyword(s):

Pancreatic Cancer ◽

Early Stage ◽

Model Performance ◽

Training Data ◽

Diabetes Diagnosis ◽

Whole Genome ◽

Data Set ◽

Independent Validation ◽

Onset Diabetes ◽

New Onset

e16265 Background: Pancreatic cancer (PaCa) is the third leading cause of cancer death in the United States despite its low incidence rate, owing to a 5-year survival rate of 10%. It is often asymptomatic in early stage, resulting in the majority of diagnoses occurring when cancer has already metastasized to distant organs. Late diagnosis deprives patients of potentially curative treatments such as surgery and impacts survival rates. Diabetes can be an early symptom of PaCa. Indeed, 25% of PaCa patients had a preceding diabetes diagnosis. Among all people with new onset diabetes (NOD), 0.85% will be diagnosed with PaCa within 3 years, which represents 6-8 fold increased risk for PaCa compared to the general population. Surveillance of the NOD population for PaCa presents an opportunity to shift PaCa diagnosis to earlier stage by finding it sooner. Methods: Whole blood was obtained from a cohort of 117 PaCa patients as well as 800 non-cancer controls with and without NOD. Plasma was processed to isolate cfDNA and 5hmC and low pass whole genome libraries were generated and sequenced. The EpiDetect assay combines 5hmC and whole genome sequencing data and were generated using Bluestar Genomics’s technology platform. Results: To investigate whether PaCa can be detected in plasma, we interrogated plasma-derived cfDNA epigenomic and genomic signal from PaCa patients and non-cancer controls. We first trained stacked ensemble models on PaCa and non-cancer samples utilizing 5hmC, fragmentation and CNV-based biomarkers from cfDNA. These models performed stably with a median of 72.8% sensitivity and 90.1% specificity measured across 25 outer fold iterations using the training data set, which was composed of 50% early stage (Stages I & II) disease. The final binomial ensemble model was trained using all of the training data, yielding an area under the receiver operating characteristic curve (auROC) of 0.9, with 75% sensitivity and 89% specificity. This model was then tested on an independent validation data set from 33 PaCa patients (24 with diabetes, 15 of which was NOD) and 202 non-cancer control patients (76 with diabetes, 51 of which was NOD) and yielded a classification performance auROC of 0.9 with 67% sensitivity at 92% specificity. Lastly, model performance in the subset of patient cohort with NOD only had an auROC of 0.87 with 60% sensitivity at 88% specificity. Conclusions: Our results indicate that 5hmC profiles along with CNV and fragmentation patterns from cfDNA can be used to detect PaCa in plasma-derived cfDNA. Overall, model performance was stable and consistent between the training and independent validation datasets. A larger clinical study is under development to investigate the utility of the model described in this pilot study in identifying occult PaCa within the NOD population, with the aim of shifting diagnosis to early stage and potentially improving patient outcomes.

Download Full-text

Water Quality Prediction Using Statistical Tool and Machine Learning Algorithm

Waste Management ◽

10.4018/978-1-7998-1210-4.ch029 ◽

2020 ◽

pp. 609-623

Author(s):

Arun Kumar Beerala ◽

Gobinath R. ◽

Shyamala G. ◽

Siribommala Manvitha

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Training Data ◽

Machine Learning Techniques ◽

Statistical Tool ◽

Data Set ◽

Water Quality Prediction ◽

Living Things ◽

Sampling Locations ◽

Different Seasons

Water is the most valuable natural resource for all living things and the ecosystem. The quality of groundwater is changed due to change in ecosystem, industrialisation, and urbanisation, etc. In the study, 60 samples were taken and analysed for various physio-chemical parameters. The sampling locations were located using global positioning system (GPS) and were taken for two consecutive years for two different seasons, monsoon (Nov-Dec) and post-monsoon (Jan-Mar). In 2016-2017 and 2017-2018 pH, EC, and TDS were obtained in the field. Hardness and Chloride are determined using titration method. Nitrate and Sulphate were determined using Spectrophotometer. Machine learning techniques were used to train the data set and to predict the unknown values. The dominant elements of groundwater are as follows: Ca2, Mg2 for cation and Cl-, SO42, NO3− for anions. The regression value for the training data set was found to be 0.90596, and for the entire network, it was found to be 0.81729. The best performance was observed as 0.0022605 at epoch 223.

Download Full-text

New environmental covariates for digital soil mapping

Developments in Soil Science - Digital Soil Mapping - An Introductory Perspective ◽

10.1016/s0166-2481(06)31054-9 ◽

2006 ◽

pp. 205-206

Keyword(s):

Soil Mapping ◽

Digital Soil Mapping ◽

Environmental Covariates

Download Full-text

Lost in Space: Geolocation in Event Data

Political Science Research and Methods ◽

10.1017/psrm.2018.23 ◽

2018 ◽

Vol 7 (04) ◽

pp. 871-888 ◽

Cited By ~ 6

Author(s):

Sophie J. Lee ◽

Howard Liu ◽

Michael D. Ward

Keyword(s):

Learning Algorithm ◽

Text Processing ◽

Contextual Information ◽

Training Data ◽

Supervised Machine Learning ◽

Model Parameters ◽

Event Data ◽

Data Set ◽

N Gram ◽

Automated Text Processing

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Download Full-text

Digital soil mapping based on wavelet decomposed components of environmental covariates

Geoderma ◽

10.1016/j.geoderma.2017.05.017 ◽

2017 ◽

Vol 303 ◽

pp. 118-132 ◽

Cited By ~ 12

Author(s):

Xiao-Lin Sun ◽

Hui-Li Wang ◽

Yu-Guo Zhao ◽

Chaosheng Zhang ◽

Gan-Lin Zhang

Keyword(s):

Soil Mapping ◽

Digital Soil Mapping ◽

Environmental Covariates

Download Full-text

An appropriate data set size for digital soil mapping in Erechim, Rio Grande do Sul, Brazil

Revista Brasileira de Ciência do Solo ◽

10.1590/s0100-06832013000200007 ◽

2013 ◽

Vol 37 (2) ◽

pp. 359-366 ◽

Cited By ~ 8

Author(s):

Alexandre ten Caten ◽

Ricardo Simão Diniz Dalmolin ◽

Fabrício de Araújo Pedron ◽

Luis Fernando Chimelo Ruiz ◽

Carlos Antônio da Silva

Keyword(s):

Rio Grande ◽

Northern Region ◽

Soil Mapping ◽

Digital Soil Mapping ◽

Digital Information ◽

Rio Grande Do Sul ◽

Predictive Capacity ◽

Data Set ◽

Data Volume ◽

The Impact

Digital information generates the possibility of a high degree of redundancy in the data available for fitting predictive models used for Digital Soil Mapping (DSM). Among these models, the Decision Tree (DT) technique has been increasingly applied due to its capacity of dealing with large datasets. The purpose of this study was to evaluate the impact of the data volume used to generate the DT models on the quality of soil maps. An area of 889.33 km² was chosen in the Northern region of the State of Rio Grande do Sul. The soil-landscape relationship was obtained from reambulation of the studied area and the alignment of the units in the 1:50,000 scale topographic mapping. Six predictive covariates linked to the factors soil formation, relief and organisms, together with data sets of 1, 3, 5, 10, 15, 20 and 25 % of the total data volume, were used to generate the predictive DT models in the data mining program Waikato Environment for Knowledge Analysis (WEKA). In this study, sample densities below 5 % resulted in models with lower power of capturing the complexity of the spatial distribution of the soil in the study area. The relation between the data volume to be handled and the predictive capacity of the models was best for samples between 5 and 15 %. For the models based on these sample densities, the collected field data indicated an accuracy of predictive mapping close to 70 %.

Download Full-text

Application of Support Vector Machine in Determination of Real Estate Price

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.461.818 ◽

2012 ◽

Vol 461 ◽

pp. 818-821

Author(s):

Shi Hu Zhang

Keyword(s):

Support Vector Machine ◽

Real Estate ◽

Learning Algorithm ◽

Predictive Ability ◽

Training Data ◽

Small Samples ◽

Support Vector ◽

Data Set ◽

Real Estate Price

The problem of real estate prices are the current focus of the community's concern. Support Vector Machine is a new machine learning algorithm, as its excellent performance of the study, and in small samples to identify many ways, and so has its unique advantages, is now used in many areas. Determination of real estate price is a complicated problem due to its non-linearity and the small quantity of training data. In this study, support vector machine (SVM) is proposed to forecast the price of real estate price in China. The experimental results indicate that the SVM method can achieve greater accuracy than grey model, artificial neural network under the circumstance of small training data. It was also found that the predictive ability of the SVM outperformed those of some traditional pattern recognition methods for the data set used here.

Download Full-text

Environmental Covariates for Digital Soil Mapping in the Western USA

Digital Soil Mapping ◽

10.1007/978-90-481-8863-5_2 ◽

2010 ◽

pp. 17-27 ◽

Cited By ~ 2

Author(s):

J.L. Boettinger

Keyword(s):

Soil Mapping ◽

Digital Soil Mapping ◽

Environmental Covariates ◽

Western Usa

Download Full-text