scholarly journals A study of real-world micrograph data quality and machine learning model robustness

2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Xiaoting Zhong ◽  
Brian Gallagher ◽  
Keenan Eves ◽  
Emily Robertson ◽  
T. Nathan Mundhenk ◽  
...  

AbstractMachine-learning (ML) techniques hold the potential of enabling efficient quantitative micrograph analysis, but the robustness of ML models with respect to real-world micrograph quality variations has not been carefully evaluated. We collected thousands of scanning electron microscopy (SEM) micrographs for molecular solid materials, in which image pixel intensities vary due to both the microstructure content and microscope instrument conditions. We then built ML models to predict the ultimate compressive strength (UCS) of consolidated molecular solids, by encoding micrographs with different image feature descriptors and training a random forest regressor, and by training an end-to-end deep-learning (DL) model. Results show that instrument-induced pixel intensity signals can affect ML model predictions in a consistently negative way. As a remedy, we explored intensity normalization techniques. It is seen that intensity normalization helps to improve micrograph data quality and ML model robustness, but microscope-induced intensity variations can be difficult to eliminate.

Author(s):  
Xianping Du ◽  
Onur Bilgen ◽  
Hongyi Xu

Abstract Machine learning for classification has been used widely in engineering design, for example, feasible domain recognition and hidden pattern discovery. Training an accurate machine learning model requires a large dataset; however, high computational or experimental costs are major issues in obtaining a large dataset for real-world problems. One possible solution is to generate a large pseudo dataset with surrogate models, which is established with a smaller set of real training data. However, it is not well understood whether the pseudo dataset can benefit the classification model by providing more information or deteriorates the machine learning performance due to the prediction errors and uncertainties introduced by the surrogate model. This paper presents a preliminary investigation towards this research question. A classification-and-regressiontree model is employed to recognize the design subspaces to support design decision-making. It is implemented on the geometric design of a vehicle energy-absorbing structure based on finite element simulations. Based on a small set of real-world data obtained by simulations, a surrogate model based on Gaussian process regression is employed to generate pseudo datasets for training. The results showed that the tree-based method could help recognize feasible design domains efficiently. Furthermore, the additional information provided by the surrogate model enhances the accuracy of classification. One important conclusion is that the accuracy of the surrogate model determines the quality of the pseudo dataset and hence, the improvements in the machine learning model.


2021 ◽  
Vol 14 (6) ◽  
pp. 997-1005
Author(s):  
Sandeep Tata ◽  
Navneet Potti ◽  
James B. Wendt ◽  
Lauro Beltrão Costa ◽  
Marc Najork ◽  
...  

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.


2021 ◽  
Author(s):  
Anna Goldenberg ◽  
Bret Nestor ◽  
Jaryd Hunter ◽  
Raghu Kainkaryam ◽  
Erik Drysdale ◽  
...  

Abstract Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 +/- 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.


Author(s):  
Prateek Sharma ◽  
M V Ramachandra Rao ◽  
Dr B N Mohapatra ◽  
A Saxena

Cement industry is one of the largest CO2 emitters and continuously working to minimize these emissions. Use of Artificial intelligence (AI) in manufacturing helps in reducing breakdowns/ failures by avoiding frequency of startups, reducing fuel fluctuations which will ultimately reduce carbon footprint. AI offers a new mode of digital manufacturing process that can increase productivity by optimizing the use of assets at the fraction of cost. Calciner is one of the key equipment of a cement plant which dissociates the calcium carbonate into calcium oxide and carbon dioxide by taking heat input from fuel combustion. An AI model of a calciner can provide valuable information which can be implemented in real time to optimize the calciner operation resulting in fuel savings. In this study, key process parameters of continuous operation for a period of 3 months were used to train the various machine learning models and a best suitable model was selected based on metrics like RMSE and R2 value. It is found that Artificial neural network is best fitted model for the calciner. This model is able to predict the calciner outlet temperature with high degree of accuracy (+/- 2% error) when validated against real world data. This model can be used by industries to estimate the calciner outlet temperature by changing the input parameters as it is not based on the chemical and physical process taking place in the calciner but on real world historical data.


2020 ◽  
Vol 38 (15_suppl) ◽  
pp. 2051-2051
Author(s):  
Jeffrey J. Kirshner ◽  
Kelly Cohn ◽  
Steven Dunder ◽  
Karri Donahue ◽  
Madeline Richey ◽  
...  

2051 Background: Efforts to facilitate patient identification for clinical trials in routine practice, such as automating electronic health record (EHR) data reviews, are hindered by the lack of information on metastatic status in structured format. We developed a machine learning tool that infers metastatic status from unstructured EHR data, and we describe its real-world implementation. Methods: This machine learning model scans EHR documents, extracting features from text snippets surrounding key words (ie, ‘Metastatic’ ‘Progression’ ‘Local’). A regularized logistic regression model was trained, and used to classify patients across 5 metastatic status inference categories: highly-likely and likely positive, highly-likely and likely negative, and unknown. The model accuracy was characterized using the Flatiron Health EHR-derived de-identified database of patients with solid tumors, where manually abstracted information served as standard accurate reference. We assessed model accuracy using sensitivity and specificity (patients in the ‘unknown’ category omitted from numerator), negative and positive predictive values (NPV, PPV; patients ‘unknown’ included in denominator), and its performance in a real-world dataset. In a separate validation, we evaluated the accuracy gained upon additional user review of the model outputs after integration of this tool into workflows. Results: This metastatic status inference model was characterized using a sample of 66,532 patients. The model sensitivity and specificity (95%CI) were 82.% (82, 83) and 95% (95, 96), respectively; PPV was 89% (89, 90) and NPV was 94% (94, 94). In the validation sample (N = 200 originated from 5 distinct care sites), and after user review of model outputs, values increased to 97% (85, 100) for sensitivity, 98% (95, 100) for specificity, 92 (78, 98) for PPV and 99% (97, 100) for NPV. The model assigned 163/200 patients to the highly-likely categories, which were deemed not to require further EHR review by users. The prevalence of errors was 4% without user review, and 2% after user review. Conclusions: This machine learning model infers metastatic status from unstructured EHR data with high accuracy. The tool assigns metastatic status with high confidence in more than 75% of cases without requiring additional manual review, allowing more efficient identification of clinical trial candidates and clinical trial matching, thus mitigating a key barrier for clinical trial participation in community clinics.


2019 ◽  
Vol 2019 (1) ◽  
pp. 5-25 ◽  
Author(s):  
Nisarg Raval ◽  
Ashwin Machanavajjhala ◽  
Jerry Pan

Abstract Personal data garnered from various sensors are often offloaded by applications to the cloud for analytics. This leads to a potential risk of disclosing private user information. We observe that the analytics run on the cloud are often limited to a machine learning model such as predicting a user’s activity using an activity classifier. We present Olympus, a privacy framework that limits the risk of disclosing private user information by obfuscating sensor data while minimally affecting the functionality the data are intended for. Olympus achieves privacy by designing a utility aware obfuscation mechanism, where privacy and utility requirements are modeled as adversarial networks. By rigorous and comprehensive evaluation on a real world app and on benchmark datasets, we show that Olympus successfully limits the disclosure of private information without significantly affecting functionality of the application.


Sign in / Sign up

Export Citation Format

Share Document