ChemProps: A RESTful API enabled database for composite polymer name standardization

AbstractThe inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names is a major challenge for widespread use of polymer related data resources and limits broad application of materials informatics for innovation in broad classes of polymer science and polymeric based materials. The current solution of using a variety of different chemical identifiers has proven insufficient to address the challenge and is not intuitive for researchers. This work proposes a multi-algorithm-based mapping methodology entitled ChemProps that is optimized to solve the polymer indexing issue with easy-to-update design both in depth and in width. RESTful API is enabled for lightweight data exchange and easy integration across data systems. A weight factor is assigned to each algorithm to generate scores for candidate chemical names and optimized to maximize the minimum value of the score difference between the ground truth chemical name and the other candidate chemical names. Ten-fold validation is utilized on the 160 training data points to prevent overfitting issues. The obtained set of weight factors achieves a 100% test accuracy on the 54 test data points. The weight factors will evolve as ChemProps grows. With ChemProps, other polymer databases can remove duplicate entries and enable a more accurate “search by SMILES” function by using ChemProps as a common name-to-SMILES translator through API calls. ChemProps is also an excellent tool for auto-populating polymer properties thanks to its easy-to-update design.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

Intelligent transport systems. Cooperative systems. Data exchange specification for in-vehicle presentation of external road and traffic related data

10.3403/30326492 ◽

2016 ◽

Keyword(s):

Data Exchange ◽

Cooperative Systems ◽

Transport Systems ◽

Intelligent Transport Systems ◽

Intelligent Transport ◽

Related Data ◽

Data Exchange Specification

Download Full-text

Quantifying the structure of strong gravitational lens potentials with uncertainty-aware deep neural networks

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3201 ◽

2020 ◽

Vol 499 (4) ◽

pp. 5641-5652

Author(s):

Georgios Vernardos ◽

Grigorios Tsagkatakis ◽

Yannis Pantazis

Keyword(s):

Confidence Intervals ◽

Galaxy Evolution ◽

Gravitational Lensing ◽

Probability Distributions ◽

Mass Density ◽

Ground Truth ◽

Gaussian Random Fields ◽

Training Data ◽

Gravitational Lens ◽

Data Set

ABSTRACT Gravitational lensing is a powerful tool for constraining substructure in the mass distribution of galaxies, be it from the presence of dark matter sub-haloes or due to physical mechanisms affecting the baryons throughout galaxy evolution. Such substructure is hard to model and is either ignored by traditional, smooth modelling, approaches, or treated as well-localized massive perturbers. In this work, we propose a deep learning approach to quantify the statistical properties of such perturbations directly from images, where only the extended lensed source features within a mask are considered, without the need of any lens modelling. Our training data consist of mock lensed images assuming perturbing Gaussian Random Fields permeating the smooth overall lens potential, and, for the first time, using images of real galaxies as the lensed source. We employ a novel deep neural network that can handle arbitrary uncertainty intervals associated with the training data set labels as input, provides probability distributions as output, and adopts a composite loss function. The method succeeds not only in accurately estimating the actual parameter values, but also reduces the predicted confidence intervals by 10 per cent in an unsupervised manner, i.e. without having access to the actual ground truth values. Our results are invariant to the inherent degeneracy between mass perturbations in the lens and complex brightness profiles for the source. Hence, we can quantitatively and robustly quantify the smoothness of the mass density of thousands of lenses, including confidence intervals, and provide a consistent ranking for follow-up science.

Download Full-text

Automated analysis of 3D-echocardiography using spatially registered patient-specific CMR meshes

European Heart Journal - Cardiovascular Imaging ◽

10.1093/ehjci/jeaa356.425 ◽

2021 ◽

Vol 22 (Supplement_1) ◽

Author(s):

D Zhao ◽

E Ferdian ◽

GD Maso Talou ◽

GM Quill ◽

K Gilbert ◽

...

Keyword(s):

New Zealand ◽

Interobserver Variability ◽

Ground Truth ◽

Automated Analysis ◽

3D Echocardiography ◽

Training Data ◽

Patient Specific ◽

Manual Analysis ◽

Lv Mass ◽

3D Echo

Abstract Funding Acknowledgements Type of funding sources: Public grant(s) – National budget only. Main funding source(s): National Heart Foundation (NHF) of New Zealand Health Research Council (HRC) of New Zealand Artificial intelligence shows considerable promise for automated analysis and interpretation of medical images, particularly in the domain of cardiovascular imaging. While application to cardiac magnetic resonance (CMR) has demonstrated excellent results, automated analysis of 3D echocardiography (3D-echo) remains challenging, due to the lower signal-to-noise ratio (SNR), signal dropout, and greater interobserver variability in manual annotations. As 3D-echo is becoming increasingly widespread, robust analysis methods will substantially benefit patient evaluation. We sought to leverage the high SNR of CMR to provide training data for a convolutional neural network (CNN) capable of analysing 3D-echo. We imaged 73 participants (53 healthy volunteers, 20 patients with non-ischaemic cardiac disease) under both CMR and 3D-echo (<1 hour between scans). 3D models of the left ventricle (LV) were independently constructed from CMR and 3D-echo, and used to spatially align the image volumes using least squares fitting to a cardiac template. The resultant transformation was used to map the CMR mesh to the 3D-echo image. Alignment of mesh and image was verified through volume slicing and visual inspection (Fig. 1) for 120 paired datasets (including 47 rescans) each at end-diastole and end-systole. 100 datasets (80 for training, 20 for validation) were used to train a shallow CNN for mesh extraction from 3D-echo, optimised with a composite loss function consisting of normalised Euclidian distance (for 290 mesh points) and volume. Data augmentation was applied in the form of rotations and tilts (<15 degrees) about the long axis. The network was tested on the remaining 20 datasets (different participants) of varying image quality (Tab. I). For comparison, corresponding LV measurements from conventional manual analysis of 3D-echo and associated interobserver variability (for two observers) were also estimated. Initial results indicate that the use of embedded CMR meshes as training data for 3D-echo analysis is a promising alternative to manual analysis, with improved accuracy and precision compared with conventional methods. Further optimisations and a larger dataset are expected to improve network performance. (n = 20) LV EDV (ml) LV ESV (ml) LV EF (%) LV mass (g) Ground truth CMR 150.5 ± 29.5 57.9 ± 12.7 61.5 ± 3.4 128.1 ± 29.8 Algorithm error -13.3 ± 15.7 -1.4 ± 7.6 -2.8 ± 5.5 0.1 ± 20.9 Manual error -30.1 ± 21.0 -15.1 ± 12.4 3.0 ± 5.0 Not available Interobserver error 19.1 ± 14.3 14.4 ± 7.6 -6.4 ± 4.8 Not available Tab. 1. LV mass and volume differences (means ± standard deviations) for 20 test cases. Algorithm: CNN – CMR (as ground truth). Abstract Figure. Fig 1. CMR mesh registered to 3D-echo.

Download Full-text

BIAS-VARIANCE CONTROL VIA HARD POINTS SHAVING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001404003460 ◽

2004 ◽

Vol 18 (05) ◽

pp. 891-903 ◽

Cited By ~ 12

Author(s):

STEFANO MERLER ◽

BRUNO CAPRILE ◽

CESARE FURLANELLO

Keyword(s):

Control Strategy ◽

Noisy Data ◽

Real Data ◽

Training Data ◽

Regularization Technique ◽

Data Points ◽

Classification Tasks ◽

Bias Variance

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.

Download Full-text

Benchmarking and Field-Testing of the Distributed Quasi-Newton Derivative-Free Optimization Method for Field Development Optimization

10.2118/206267-ms ◽

2021 ◽

Author(s):

Faruk Alpak ◽

Yixuan Wang ◽

Guohua Gao ◽

Vivek Jain

Keyword(s):

Optimization Problems ◽

Field Testing ◽

Optimization Method ◽

Training Data ◽

Local Optima ◽

Field Development ◽

Derivative Free Optimization ◽

Derivative Free ◽

Data Points ◽

Quasi Newton

Abstract Recently, a novel distributed quasi-Newton (DQN) derivative-free optimization (DFO) method was developed for generic reservoir performance optimization problems including well-location optimization (WLO) and well-control optimization (WCO). DQN is designed to effectively locate multiple local optima of highly nonlinear optimization problems. However, its performance has neither been validated by realistic applications nor compared to other DFO methods. We have integrated DQN into a versatile field-development optimization platform designed specifically for iterative workflows enabled through distributed-parallel flow simulations. DQN is benchmarked against alternative DFO techniques, namely, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method hybridized with Direct Pattern Search (BFGS-DPS), Mesh Adaptive Direct Search (MADS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). DQN is a multi-thread optimization method that distributes an ensemble of optimization tasks among multiple high-performance-computing nodes. Thus, it can locate multiple optima of the objective function in parallel within a single run. Simulation results computed from one DQN optimization thread are shared with others by updating a unified set of training data points composed of responses (implicit variables) of all successful simulation jobs. The sensitivity matrix at the current best solution of each optimization thread is approximated by a linear-interpolation technique using all or a subset of training-data points. The gradient of the objective function is analytically computed using the estimated sensitivities of implicit variables with respect to explicit variables. The Hessian matrix is then updated using the quasi-Newton method. A new search point for each thread is solved from a trust-region subproblem for the next iteration. In contrast, other DFO methods rely on a single-thread optimization paradigm that can only locate a single optimum. To locate multiple optima, one must repeat the same optimization process multiple times starting from different initial guesses for such methods. Moreover, simulation results generated from a single-thread optimization task cannot be shared with other tasks. Benchmarking results are presented for synthetic yet challenging WLO and WCO problems. Finally, DQN method is field-tested on two realistic applications. DQN identifies the global optimum with the least number of simulations and the shortest run time on a synthetic problem with known solution. On other benchmarking problems without a known solution, DQN identified compatible local optima with reasonably smaller numbers of simulations compared to alternative techniques. Field-testing results reinforce the auspicious computational attributes of DQN. Overall, the results indicate that DQN is a novel and effective parallel algorithm for field-scale development optimization problems.

Download Full-text

Accelerated search for BaTiO3-based piezoelectrics with vertical morphotropic phase boundary using Bayesian learning

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1607412113 ◽

2016 ◽

Vol 113 (47) ◽

pp. 13301-13306 ◽

Cited By ~ 52

Author(s):

Dezhen Xue ◽

Prasanna V. Balachandran ◽

Ruihao Yuan ◽

Tao Hu ◽

Xiaoning Qian ◽

...

Keyword(s):

Solid Solution ◽

Phase Boundary ◽

Morphotropic Phase Boundary ◽

Bayesian Approach ◽

Bayesian Learning ◽

Piezoelectric Properties ◽

Training Data ◽

Materials Informatics ◽

Initial Training ◽

New Materials

An outstanding challenge in the nascent field of materials informatics is to incorporate materials knowledge in a robust Bayesian approach to guide the discovery of new materials. Utilizing inputs from known phase diagrams, features or material descriptors that are known to affect the ferroelectric response, and Landau–Devonshire theory, we demonstrate our approach for BaTiO3-based piezoelectrics with the desired target of a vertical morphotropic phase boundary. We predict, synthesize, and characterize a solid solution, (Ba0.5Ca0.5)TiO3-Ba(Ti0.7Zr0.3)O3, with piezoelectric properties that show better temperature reliability than other BaTiO3-based piezoelectrics in our initial training data.

Download Full-text

SEGMENTATION AND CLASSIFICATION OF NEPAL EARTHQUAKE INDUCED LANDSLIDES USING SENTINEL-1 PRODUCT

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b7-769-2016 ◽

2016 ◽

Vol XLI-B7 ◽

pp. 769-774

Author(s):

Saket Kunwar

Keyword(s):

Data Exchange ◽

Open Data ◽

Object Oriented ◽

Ground Truth ◽

British Geological Survey ◽

Exchange Program ◽

Amplitude Data ◽

Richter Scale ◽

Data Commons

On April 26, 2015, an earthquake of magnitude 7.8 on the Richter scale occurred, with epicentre at Barpak (28°12'20''N,84°44'19''E), Nepal. Landslides induced due to the earthquake and its aftershock added to the natural disaster claiming more than 9000 lives. Landslides represented as lines that extend from the head scarp to the toe of the deposit were mapped by the staff of the British Geological Survey and is available freely under Open Data Commons Open Database License(ODC-ODbL) license at the Humanitarian Data Exchange Program. This collection of 5578 landslides is used as preliminary ground truth in this study with the aim of producing polygonal delineation of the landslides from the polylines via object oriented segmentation. Texture measures from Sentinel-1a Ground Range Detected (GRD) Amplitude data and eigenvalue-decomposed Single Look Complex (SLC) polarimetry product are stacked for this purpose. This has also enabled the investigation of landslide properties in the H-Alpha plane, while developing a classification mechanism for identifying the occurrence of landslides.

Download Full-text

From OpenEHR to FHIR and OMOP Data Model for Microbiology Findings

Studies in Health Technology and Informatics - Public Health and Informatics ◽

10.3233/shti210189 ◽

2021 ◽

Author(s):

Eugenia Rinaldi ◽

Sylvia Thun

Keyword(s):

Infection Control ◽

Data Model ◽

Data Exchange ◽

Control Group ◽

Common Data Model ◽

University Hospitals ◽

Data Set ◽

Related Data ◽

National Initiative ◽

Data Elements

HiGHmed is a German Consortium where eight University Hospitals have agreed to the cross-institutional data exchange through novel medical informatics solutions. The HiGHmed Use Case Infection Control group has modelled a set of infection-related data in the openEHR format. In order to establish interoperability with the other German Consortia belonging to the same national initiative, we mapped the openEHR information to the Fast Healthcare Interoperability Resources (FHIR) format recommended within the initiative. FHIR enables fast exchange of data thanks to the discrete and independent data elements into which information is organized. Furthermore, to explore the possibility of maximizing analysis capabilities for our data set, we subsequently mapped the FHIR elements to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). The OMOP data model is designed to support the conduct of research to identify and evaluate associations between interventions and outcomes caused by these interventions. Mapping across standard allows to exploit their peculiarities while establishing and/or maintaining interoperability. This article provides an overview of our experience in mapping infection control related data across three different standards openEHR, FHIR and OMOP CDM.

Download Full-text

USING SEMANTICALLY PAIRED IMAGES TO IMPROVE DOMAIN ADAPTATION FOR THE SEMANTIC SEGMENTATION OF AERIAL IMAGES

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-483-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 483-492

Author(s):

D. Gritzner ◽

J. Ostermann

Keyword(s):

Time Window ◽

Domain Adaptation ◽

Geographical Area ◽

Model Performance ◽

Ground Truth ◽

Semantic Segmentation ◽

Training Data ◽

Aerial Images ◽

Target Domain ◽

Training Examples

Abstract. Modern machine learning, especially deep learning, which is used in a variety of applications, requires a lot of labelled data for model training. Having an insufficient amount of training examples leads to models which do not generalize well to new input instances. This is a particular significant problem for tasks involving aerial images: often training data is only available for a limited geographical area and a narrow time window, thus leading to models which perform poorly in different regions, at different times of day, or during different seasons. Domain adaptation can mitigate this issue by using labelled source domain training examples and unlabeled target domain images to train a model which performs well on both domains. Modern adversarial domain adaptation approaches use unpaired data. We propose using pairs of semantically similar images, i.e., whose segmentations are accurate predictions of each other, for improved model performance. In this paper we show that, as an upper limit based on ground truth, using semantically paired aerial images during training almost always increases model performance with an average improvement of 4.2% accuracy and .036 mean intersection-over-union (mIoU). Using a practical estimate of semantic similarity, we still achieve improvements in more than half of all cases, with average improvements of 2.5% accuracy and .017 mIoU in those cases.

Download Full-text