Tuning machine learning dropout for subsurface uncertainty model accuracy

Author(s):  
Eduardo Maldonado-Cruz ◽  
Michael J. Pyrcz
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
William Greig Mitchell ◽  
Edward Christopher Dee ◽  
Leo Anthony Celi

AbstractCho et al. report deep learning model accuracy for tilted myopic disc detection in a South Korean population. Here we explore the importance of generalisability of machine learning (ML) in healthcare, and we emphasise that recurrent underrepresentation of data-poor regions may inadvertently perpetuate global health inequity.Creating meaningful ML systems is contingent on understanding how, when, and why different ML models work in different settings. While we echo the need for the diversification of ML datasets, such a worthy effort would take time and does not obviate uses of presently available datasets if conclusions are validated and re-calibrated for different groups prior to implementation.The importance of external ML model validation on diverse populations should be highlighted where possible – especially for models built with single-centre data.


2019 ◽  
Vol 3 (Supplement_1) ◽  
Author(s):  
Leila Shinn ◽  
Yutong Li ◽  
Ruoqing Zhu ◽  
Aditya Mansharamani ◽  
Loretta Auvil ◽  
...  

Abstract Objectives To better understand host-microbe interactions, a more computationally intensive, multivariate, machine learning approach must be utilized. Accordingly, we aimed to identify biomarkers with high predictive accuracy for dietary intake. Methods Data were aggregated from five randomized, controlled, feeding studies in adults (n = 199) that provided avocados, almonds, broccoli, walnuts, or whole grain oats and whole grain barley. Fecal samples were collected during treatment and control periods for each study for DNA extraction. Subsequently, the 16S rRNA gene (V4 region) was amplified and sequenced. Sequence data were analyzed using DADA2 and QIIME2. Marginal screening using the Kruskal-Wallis test was performed on all species-level taxa to examine the differences between each of the 6 treatment groups and respective control groups. The top 20 species from each diet were selected and pooled together for multiclass classification using random forest. The resultant bacterial species were further decreased in a stepwise fashion and iteratively analyzed with the variable importance generated from random forest to determine a compact feature set with a minor loss of accuracy in the prediction of food consumed. Result When all six foods were analyzed together using the top 20 species of each diet, oats and barley were frequently confused for each other, with 44% and 47% classification error, respectively, and the overall model accuracy was 66%. Collapsing oats and barley into one category, whole grains, reduced the classification error of the whole grain category to 6% and improved the overall model accuracy to 73%. Refitting the random forest with the top 30, 20, and 10 important species resulted in correct identification of the 5 foods (avocados, almonds, broccoli, walnuts, and whole grains) 75%, 74%, and 70% of the time, respectively. Conclusions These results reveal promise in accurately predicting foods consumed using bacterial species as biomarkers. Ongoing analyses include incorporation of metagenomic and metabolomic data into the models to improve predictive accuracy and utilize the multi-omics dataset to predict health status. Long-term, these approaches may inform diet-microbiota-tailored recommendations. Funding Sources This research was funded by The Foundation for Food and Agriculture Research, USDA, Hass Avocado Board, and USDA National Institute of Food and Agriculture, Hatch project 1009249.


2020 ◽  
Vol 34 (01) ◽  
pp. 784-791 ◽  
Author(s):  
Qinbin Li ◽  
Zhaomin Wu ◽  
Zeyi Wen ◽  
Bingsheng He

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.


2021 ◽  
Vol 9 ◽  
Author(s):  
Geetha Mani ◽  
◽  
Joshi Kumar Viswanadhapalli ◽  
Albert Alexander Stonie ◽  
◽  
...  

Air is one of the most fundamental constituents for the sustenance of life on earth. The meteorological, traffic factors, consumption of non-renewable energy sources, and industrial parameters are steadily increasing air pollution. These factors affect the welfare and prosperity of life on earth; therefore, the nature of air quality in our environment needs to be monitored continuously. The Air Quality Index (AQI), which indicates air quality, is influenced by several individual factors such as the accumulation of NO2, CO, O3, PM2.5, SO2, and PM10. This research paper aims to predict and forecast the AQI with Machine Learning (ML) techniques, namely linear regression and time series analysis. Primarily,Multi Linear Regression (MLR) model, supervised machine learning, is developed to predict AQI. NO2, Ozone(O3), PM 2.5, and SO2 sensor output collected from Central Pollution Control Board (CPCB) – Chennai region, India feed as input features and optimized AQI calculated from sensor's output set as a target to train the regression model. The obtained model parameters are validated with new and unseen sensor output. The Key Performance Indices(KPI) like co-efficient of determination, root mean square error and mean absolute error were calculated to validate the model accuracy. The K-cross-fold validation for testing data of MLR was obtained as around 92%. Secondly, the Auto-Regressive Integrated Moving Average (ARIMA) time series model is applied to forecast the AQI. The obtained model parameters were validated with unseen data with a timestamp. The forecasted AQI value of the next 15 days lies in a 95 % confidence interval zone. The model accuracy of test data was obtained as more than 80%.


2019 ◽  
Vol 9 (7) ◽  
pp. 1459 ◽  
Author(s):  
Huihui Mao ◽  
Jihua Meng ◽  
Fujiang Ji ◽  
Qiankun Zhang ◽  
Huiting Fang

Leaf area index (LAI) is a crucial crop biophysical parameter that has been widely used in a variety of fields. Five state-of-the-art machine learning regression algorithms (MLRAs), namely, artificial neural network (ANN), support vector regression (SVR), Gaussian process regression (GPR), random forest (RF) and gradient boosting regression tree (GBRT), have been used in the retrieval of cotton LAI with Sentinel-2 spectral bands. The performances of the five machine learning models are compared for better applications of MLRAs in remote sensing, since challenging problems remain in the selection of MLRAs for crop LAI retrieval, as well as the decision as to the optimal number for the training sample size and spectral bands to different MLRAs. A comprehensive evaluation was employed with respect to model accuracy, computational efficiency, sensitivity to training sample size and sensitivity to spectral bands. We conducted the comparison of five MLRAs in an agricultural area of Northwest China over three cotton seasons with the corresponding field campaigns for modeling and validation. Results show that the GBRT model outperforms the other models with respect to model accuracy in average ( R 2 ¯ = 0.854, R M S E ¯ = 0.674 and M A E ¯ = 0.456). SVR achieves the best performance in computational efficiency, which means it is fast to train, and to validate that it has great potentials to deliver near-real-time operational products for crop management. As for sensitivity to training sample size, GBRT behaves as the most robust model, and provides the best model accuracy on the average among the variations of training sample size, compared with other models ( R 2 ¯ = 0.884, R M S E ¯ = 0.615 and M A E ¯ = 0.452). Spectral bands sensitivity analysis with dCor (distance correlation), combined with the backward elimination approach, indicates that SVR, GPR and RF provide relatively robust performance to the spectral bands, while ANN outperforms the other models in terms of model accuracy on the average among the reduction of spectral bands ( R 2 ¯ = 0.881, R M S E ¯ = 0.625 and M A E ¯ = 0.480). A comprehensive evaluation indicates that GBRT is an appealing alternative for cotton LAI retrieval, except for its computational efficiency. Despite the different performance of the ML models, all models exhibited considerable potential for cotton LAI retrieval, which could offer accurate crop parameters information timely and accurately for crop fields management and agricultural production decisions.


2021 ◽  
pp. 517-530
Author(s):  
Juily Vasandani ◽  
Saumya Bharti ◽  
Deepankar Singh ◽  
Shreeansh Priyadarshi

2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Camelia Berghian-Grosan ◽  
Dana Alina Magdas

AbstractThrough this pilot study, the association between Raman spectroscopy and Machine Learning algorithms were used for the first time with the purpose of distillates differentiation with respect to trademark, geographical and botanical origin. Two spectral Raman ranges (region I—200–600 cm−1 and region II—1200–1400 cm−1) appeared to have the higher discrimination potential for the investigated distillates. The proposed approach proved to be a very effective one for trademark fingerprint differentiation, a model accuracy of 95.5% being obtained (only one sample was misclassified). A comparable model accuracy (90.9%) was achieved for the geographical discrimination of the fruit spirits which can be considered as a very good one taking into account that this classification was made inside Transylvania region, among neighbouring areas. Because the trademark fingerprint is the prevailing one, the successfully distillate type differentiation, with respect to the fruit variety, was possible to be made only inside of each producing entity.


Breast cancer in women is one of the most dangerous cancers leading to death in women by developing breast tissue. In this work, the application of the Deep Neural Network (DNN) model is implemented on AWS machine learning platform, besides, a comparison with other ML techniques includes XGBoost and Random Forest on a public dataset. Breast cancer prediction based on DNN model with Hyperparameter tuning has the best results of the plot of model accuracy for the training and validation sets and performance evaluation metrics to test the model.


2021 ◽  
Vol 12 (1) ◽  
pp. 1-17
Author(s):  
Swati V. Narwane ◽  
Sudhir D. Sawarkar

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.


Sign in / Sign up

Export Citation Format

Share Document