Joint machine learning and human learning design with sequential active learning and outlier detection for linear regression problems

Author(s):  
Xiaohua Li ◽  
Jian Zheng
2021 ◽  
Vol 13 (2) ◽  
pp. 287
Author(s):  
Katja Berger ◽  
Juan Pablo Rivera Caicedo ◽  
Luca Martino ◽  
Matthias Wocher ◽  
Tobias Hank ◽  
...  

The current exponential increase of spatiotemporally explicit data streams from satellite-based Earth observation missions offers promising opportunities for global vegetation monitoring. Intelligent sampling through active learning (AL) heuristics provides a pathway for fast inference of essential vegetation variables by means of hybrid retrieval approaches, i.e., machine learning regression algorithms trained by radiative transfer model (RTM) simulations. In this study we summarize AL theory and perform a brief systematic literature survey about AL heuristics used in the context of Earth observation regression problems over terrestrial targets. Across all relevant studies it appeared that: (i) retrieval accuracy of AL-optimized training data sets outperformed models trained over large randomly sampled data sets, and (ii) Euclidean distance-based (EBD) diversity method tends to be the most efficient AL technique in terms of accuracy and computational demand. Additionally, a case study is presented based on experimental data employing both uncertainty and diversity AL criteria. Hereby, a a simulated training data base by the PROSAIL-PRO canopy RTM is used to demonstrate the benefit of AL techniques for the estimation of total leaf carotenoid content (Cxc) and leaf water content (Cw). Gaussian process regression (GPR) was incorporated to minimize and optimize the training data set with AL. Training the GPR algorithm on optimally AL-based sampled data sets led to improved variable retrievals compared to training on full data pools, which is further demonstrated on a mapping example. From these findings we can recommend the use of AL-based sub-sampling procedures to select the most informative samples out of large training data pools. This will not only optimize regression accuracy due to exclusion of redundant information, but also speed up processing time and reduce final model size of kernel-based machine learning regression algorithms, such as GPR. With this study we want to encourage further testing and implementation of AL sampling methods for hybrid retrieval workflows. AL can contribute to the solution of regression problems within the framework of operational vegetation monitoring using satellite imaging spectroscopy data, and may strongly facilitate data processing for cloud-computing platforms.


2021 ◽  
Author(s):  
Vu-Linh Nguyen ◽  
Mohammad Hossein Shaker ◽  
Eyke Hüllermeier

AbstractVarious strategies for active learning have been proposed in the machine learning literature. In uncertainty sampling, which is among the most popular approaches, the active learner sequentially queries the label of those instances for which its current prediction is maximally uncertain. The predictions as well as the measures used to quantify the degree of uncertainty, such as entropy, are traditionally of a probabilistic nature. Yet, alternative approaches to capturing uncertainty in machine learning, alongside with corresponding uncertainty measures, have been proposed in recent years. In particular, some of these measures seek to distinguish different sources and to separate different types of uncertainty, such as the reducible (epistemic) and the irreducible (aleatoric) part of the total uncertainty in a prediction. The goal of this paper is to elaborate on the usefulness of such measures for uncertainty sampling, and to compare their performance in active learning. To this end, we instantiate uncertainty sampling with different measures, analyze the properties of the sampling strategies thus obtained, and compare them in an experimental study.


Soil Systems ◽  
2021 ◽  
Vol 5 (3) ◽  
pp. 41
Author(s):  
Tulsi P. Kharel ◽  
Amanda J. Ashworth ◽  
Phillip R. Owens ◽  
Dirk Philipp ◽  
Andrew L. Thomas ◽  
...  

Silvopasture systems combine tree and livestock production to minimize market risk and enhance ecological services. Our objective was to explore and develop a method for identifying driving factors linked to productivity in a silvopastoral system using machine learning. A multi-variable approach was used to detect factors that affect system-level output (i.e., plant production (tree and forage), soil factors, and animal response based on grazing preference). Variables from a three-year (2017–2019) grazing study, including forage, tree, soil, and terrain attribute parameters, were analyzed. Hierarchical variable clustering and random forest model selected 10 important variables for each of four major clusters. A stepwise multiple linear regression and regression tree approach was used to predict cattle grazing hours per animal unit (h ha−1 AU−1) using 40 variables (10 per cluster) selected from 130 total variables. Overall, the variable ranking method selected more weighted variables for systems-level analysis. The regression tree performed better than stepwise linear regression for interpreting factor-level effects on animal grazing preference. Cattle were more likely to graze forage on soils with Cd levels <0.04 mg kg−1 (126% greater grazing hours per AU), soil Cr <0.098 mg kg−1 (108%), and a SAGA wetness index of <2.7 (57%). Cattle also preferred grazing (88%) native grasses compared to orchardgrass (Dactylis glomerata L.). The result shows water flow within the landscape position (wetness index), and associated metals distribution may be used as an indicator of animal grazing preference. Overall, soil nutrient distribution patterns drove grazing response, although animal grazing preference was also influenced by aboveground (forage and tree), soil, and landscape attributes. Machine learning approaches helped explain pasture use and overall drivers of grazing preference in a multifunctional system.


2021 ◽  
Vol 11 (9) ◽  
pp. 3866
Author(s):  
Jun-Ryeol Park ◽  
Hye-Jin Lee ◽  
Keun-Hyeok Yang ◽  
Jung-Keun Kook ◽  
Sanghee Kim

This study aims to predict the compressive strength of concrete using a machine-learning algorithm with linear regression analysis and to evaluate its accuracy. The open-source software library TensorFlow was used to develop the machine-learning algorithm. In the machine-earning algorithm, a total of seven variables were set: water, cement, fly ash, blast furnace slag, sand, coarse aggregate, and coarse aggregate size. A total of 4297 concrete mixtures with measured compressive strengths were employed to train and testing the machine-learning algorithm. Of these, 70% were used for training, and 30% were utilized for verification. For verification, the research was conducted by classifying the mixtures into three cases: the case where the machine-learning algorithm was trained using all the data (Case-1), the case where the machine-learning algorithm was trained while maintaining the same number of training dataset for each strength range (Case-2), and the case where the machine-learning algorithm was trained after making the subcase of each strength range (Case-3). The results indicated that the error percentages of Case-1 and Case-2 did not differ significantly. The error percentage of Case-3 was far smaller than those of Case-1 and Case-2. Therefore, it was concluded that the range of training dataset of the concrete compressive strength is as important as the amount of training dataset for accurately predicting the concrete compressive strength using the machine-learning algorithm.


2021 ◽  
Vol 5 (3) ◽  
pp. 1-30
Author(s):  
Gonçalo Jesus ◽  
António Casimiro ◽  
Anabela Oliveira

Sensor platforms used in environmental monitoring applications are often subject to harsh environmental conditions while monitoring complex phenomena. Therefore, designing dependable monitoring systems is challenging given the external disturbances affecting sensor measurements. Even the apparently simple task of outlier detection in sensor data becomes a hard problem, amplified by the difficulty in distinguishing true data errors due to sensor faults from deviations due to natural phenomenon, which look like data errors. Existing solutions for runtime outlier detection typically assume that the physical processes can be accurately modeled, or that outliers consist in large deviations that are easily detected and filtered by appropriate thresholds. Other solutions assume that it is possible to deploy multiple sensors providing redundant data to support voting-based techniques. In this article, we propose a new methodology for dependable runtime detection of outliers in environmental monitoring systems, aiming to increase data quality by treating them. We propose the use of machine learning techniques to model each sensor behavior, exploiting the existence of correlated data provided by other related sensors. Using these models, along with knowledge of processed past measurements, it is possible to obtain accurate estimations of the observed environment parameters and build failure detectors that use these estimations. When a failure is detected, these estimations also allow one to correct the erroneous measurements and hence improve the overall data quality. Our methodology not only allows one to distinguish truly abnormal measurements from deviations due to complex natural phenomena, but also allows the quantification of each measurement quality, which is relevant from a dependability perspective. We apply the methodology to real datasets from a complex aquatic monitoring system, measuring temperature and salinity parameters, through which we illustrate the process for building the machine learning prediction models using a technique based on Artificial Neural Networks, denoted ANNODE ( ANN Outlier Detection ). From this application, we also observe the effectiveness of our ANNODE approach for accurate outlier detection in harsh environments. Then we validate these positive results by comparing ANNODE with state-of-the-art solutions for outlier detection. The results show that ANNODE improves existing solutions regarding accuracy of outlier detection.


2021 ◽  
Author(s):  
Lun Ai ◽  
Stephen H. Muggleton ◽  
Céline Hocquette ◽  
Mark Gromowski ◽  
Ute Schmid

AbstractGiven the recent successes of Deep Learning in AI there has been increased interest in the role and need for explanations in machine learned theories. A distinct notion in this context is that of Michie’s definition of ultra-strong machine learning (USML). USML is demonstrated by a measurable increase in human performance of a task following provision to the human of a symbolic machine learned theory for task performance. A recent paper demonstrates the beneficial effect of a machine learned logic theory for a classification task, yet no existing work to our knowledge has examined the potential harmfulness of machine’s involvement for human comprehension during learning. This paper investigates the explanatory effects of a machine learned theory in the context of simple two person games and proposes a framework for identifying the harmfulness of machine explanations based on the Cognitive Science literature. The approach involves a cognitive window consisting of two quantifiable bounds and it is supported by empirical evidence collected from human trials. Our quantitative and qualitative results indicate that human learning aided by a symbolic machine learned theory which satisfies a cognitive window has achieved significantly higher performance than human self learning. Results also demonstrate that human learning aided by a symbolic machine learned theory that fails to satisfy this window leads to significantly worse performance than unaided human learning.


Sign in / Sign up

Export Citation Format

Share Document