scholarly journals Using Active Learning to Develop Machine Learning Models for Reaction Yield Prediction

Author(s):  
Hampus Gummesson Svensson ◽  
Simon Viet Johansson ◽  
Esben Bjerrum ◽  
Alexander Schliep ◽  
Morteza Haghir Chehreghani ◽  
...  
Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 1932
Author(s):  
Ramyar Saeedi ◽  
Keyvan Sasani ◽  
Assefaw H. Gebremedhin

Mobile health monitoring plays a central role in the future of cyber physical systems (CPS) for healthcare applications. Such monitoring systems need to process user data accurately. Unlike in other human-centered CPS, in healthcare CPS, the user functions in multiple roles all at the same time: as an operator, an actuator, the physical environment and, most importantly, the target that needs to be monitored in the process. Therefore, mobile health CPS devices face highly dynamic settings generally, and accuracy of the machine learning models the devices employ may drop dramatically every time a change in setting happens. Novel learning architecture that specifically address challenges associated with dynamic environments are therefore needed. Using active learning and transfer learning as organizing principles, we propose a collaborative multiple-expert architecture and accompanying algorithms for the design of machine learning models that autonomously adapt to a new configuration, context, or user need. Specifically, our architecture and its constituent algorithms are designed to manage heterogeneous knowledge sources or experts with varying levels of confidence and type while minimizing adaptation cost. Additionally, our framework incorporates a mechanism for collaboration among experts to enrich their knowledge, which in turn decreases both cost and uncertainty of data labeling in future steps. We evaluate the efficacy of the architecture using two publicly available human activity datasets. We attain activity recognition accuracy of over 85 % (for the first dataset) and 92 % (for the second dataset) by labeling only 15 % of unlabeled data.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Ye Sheng ◽  
Yasong Wu ◽  
Jiong Yang ◽  
Wencong Lu ◽  
Pierre Villars ◽  
...  

Abstract The Materials Genome Initiative requires the crossing of material calculations, machine learning, and experiments to accelerate the material development process. In recent years, data-based methods have been applied to the thermoelectric field, mostly on the transport properties. In this work, we combined data-driven machine learning and first-principles automated calculations into an active learning loop, in order to predict the p-type power factors (PFs) of diamond-like pnictides and chalcogenides. Our active learning loop contains two procedures (1) based on a high-throughput theoretical database, machine learning methods are employed to select potential candidates and (2) computational verification is applied to these candidates about their transport properties. The verification data will be added into the database to improve the extrapolation abilities of the machine learning models. Different strategies of selecting candidates have been tested, finally the Gradient Boosting Regression model of Query by Committee strategy has the highest extrapolation accuracy (the Pearson R = 0.95 on untrained systems). Based on the prediction from the machine learning models, binary pnictides, vacancy, and small atom-containing chalcogenides are predicted to have large PFs. The bonding analysis reveals that the alterations of anionic bonding networks due to small atoms are beneficial to the PFs in these compounds.


2019 ◽  
Vol 10 (35) ◽  
pp. 8154-8163 ◽  
Author(s):  
Yao Zhang ◽  
Alpha A. Lee

We report a statistically principled method to quantify the uncertainty of machine learning models for molecular properties prediction. We show that this uncertainty estimate can be used to judiciously design experiments.


2021 ◽  
Author(s):  
Amit Kumar Srivast ◽  
Nima Safaei ◽  
Saeed Khaki ◽  
Gina Lopez ◽  
Wenzhi Zeng ◽  
...  

Abstract Crop yield forecasting depends on many interactive factors including crop genotype, weather, soil, and management practices. This study analyzes the performance of machine learning and deep learning methods for winter wheat yield prediction using extensive datasets of weather, soil, and crop phenology. We propose a convolutional neural network (CNN) which uses the 1-dimentional convolution operation to capture the time dependencies of environmental variables. The proposed CNN, evaluated along with other machine learning models for winter wheat yield prediction in Germany, outperformed all other models tested. To address the seasonality, weekly features were used that explicitly take soil moisture and meteorological events into account. Our results indicated that nonlinear models such as deep learning models and XGboost are more effective in finding the functional relationship between the crop yield and input data compared to linear models and deep neural networks had a higher prediction accuracy than XGboost. One of the main limitations of machine learning models is their black box property. Therefore, we moved beyond prediction and performed feature selection, as it provides key results towards explaining yield prediction (variable importance by time). As such, our study indicates which variables have the most significant effect on winter wheat yield.


2021 ◽  
Vol 13 (16) ◽  
pp. 3322
Author(s):  
Dan Li ◽  
Yuxin Miao ◽  
Sanjay K. Gupta ◽  
Carl J. Rosen ◽  
Fei Yuan ◽  
...  

Accurate high-resolution yield maps are essential for identifying spatial yield variability patterns, determining key factors influencing yield variability, and providing site-specific management insights in precision agriculture. Cultivar differences can significantly influence potato (Solanum tuberosum L.) tuber yield prediction using remote sensing technologies. The objective of this study was to improve potato yield prediction using unmanned aerial vehicle (UAV) remote sensing by incorporating cultivar information with machine learning methods. Small plot experiments involving different cultivars and nitrogen (N) rates were conducted in 2018 and 2019. UAV-based multi-spectral images were collected throughout the growing season. Machine learning models, i.e., random forest regression (RFR) and support vector regression (SVR), were used to combine different vegetation indices with cultivar information. It was found that UAV-based spectral data from the early growing season at the tuber initiation stage (late June) were more correlated with potato marketable yield than the spectral data from the later growing season at the tuber maturation stage. However, the best performing vegetation indices and the best timing for potato yield prediction varied with cultivars. The performance of the RFR and SVR models using only remote sensing data was unsatisfactory (R2 = 0.48–0.51 for validation) but was significantly improved when cultivar information was incorporated (R2 = 0.75–0.79 for validation). It is concluded that combining high spatial-resolution UAV images and cultivar information using machine learning algorithms can significantly improve potato yield prediction than methods without using cultivar information. More studies are needed to improve potato yield prediction using more detailed cultivar information, soil and landscape variables, and management information, as well as more advanced machine learning models.


10.2196/17984 ◽  
2020 ◽  
Vol 8 (3) ◽  
pp. e17984 ◽  
Author(s):  
Irena Spasic ◽  
Goran Nenadic

Background Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. Results The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. Conclusions We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.


2021 ◽  
Vol 12 ◽  
Author(s):  
Qiyu Zhou ◽  
Douglas J. Soldat

Nitrogen is the most limiting nutrient for turfgrass growth. Instead of pursuing the maximum yield, most turfgrass managers use nitrogen (N) to maintain a sub-maximal growth rate. Few tools or soil tests exist to help managers guide N fertilizer decisions. Turf growth prediction models have the potential to be useful, but the currently existing turf growth prediction model only takes temperature into account, limiting its accuracy. This study developed machine-learning-based turf growth models using the random forest (RF) algorithm to estimate short-term turfgrass clipping yield. To build the RF model, a large set of variables were extracted as predictors including the 7-day weather, traffic intensity, soil moisture content, N fertilization rate, and the normalized difference red edge (NDRE) vegetation index. In this study, the data were collected from two putting greens where the turfgrass received 0 to 1,800 round/week traffic rates, various irrigation rates to maintain the soil moisture content between 9 and 29%, and N fertilization rates of 0 to 17.5 kg ha–1 applied biweekly. The RF model agreed with the actual clipping yield collected from the experimental results. The temperature and relative humidity were the most important weather factors. Including NDRE improved the prediction accuracy of the model. The highest coefficient of determination (R2) of the RF model was 0.64 for the training dataset and was 0.47 for the testing data set upon the evaluation of the model. This represented a large improvement over the existing growth prediction model (R2 = 0.01). However, the machine-learning models created were not able to accurately predict the clipping production at other locations. Individual golf courses can create customized growth prediction models using clipping volume to eliminate the deviation caused by temporal and spatial variability. Overall, this study demonstrated the feasibility of creating machine-learning-based yield prediction models that may be able to guide N fertilization decisions on golf course putting greens and presumably other turfgrass areas.


Sign in / Sign up

Export Citation Format

Share Document