AutoSolvate: A Toolkit for Automating Quantum Chemistry Design and Discovery of Solvated Molecules

The availability of large, high-quality data sets is crucial for artificial intelligence design and discovery in chemistry. Despite the essential roles of solvents in chemistry, the rapid computational data set generation of solution-phase molecular properties at the quantum mechanical level of theory was previously hampered by the complicated simulation procedure. Software toolkits that can automate the procedure to set up high-throughput explicit-solvent quantum chemistry (QC) calculations for arbitrary solutes and solvents in an open-source framework are still lacking. We developed AutoSolvate, an open-source toolkit to streamline the workflow for QC calculation of explicitly solvated molecules. It automates the solvated-structure generation, force field fitting, configuration sampling, and the final extraction of microsolvated cluster structures that QC packages can readily use to predict molecular properties of interest. AutoSolvate is available through both a command line interface and a graphical user interface, making it accessible to the broader scientific community. To improve the quality of the initial structures generated by AutoSolvate, we investigated the dependence of solute-solvent closeness on solute/solvent identities and trained a machine learning model to predict the closeness and guide initial structure generation. Finally, we tested the capability of AutoSolvate for rapid data set curation by calculating the outer-sphere reorganization energy of a large data set of 166 redox couples, which demonstrated the promise of the AutoSolvate package for chemical discovery efforts.

Download Full-text

HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder

10.26434/chemrxiv-2021-18x0d ◽

2021 ◽

Author(s):

Tagir Akhmetshin ◽

Arkadii Lin ◽

Daniyar Mazitov ◽

Evgenii Ziaikin ◽

Timur Madzhidov ◽

...

Keyword(s):

Open Source ◽

High Performance ◽

Structure Generation ◽

Data Set ◽

Labeled Graphs

Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source architecture HyFactor which is inspired by previously reported DEFactor architecture and based on the hydrogen labeled graphs. Since the original DEFactor code was not available, its new implementation (ReFactor) was prepared in this work for the benchmarking purpose. HyFactor demonstrates its high performance on the ZINC 250K MOSES and ChEMBL data set and in molecular generation tasks, it is considerably more effective than ReFactor. The code of HyFactor and all models obtained in this study are publicly available from our GitHub repository: https://github.com/Laboratoire-de-Chemoinformatique/hyfactor

Download Full-text

WEC-Sim Phase 1 Validation Testing: Experimental Setup and Initial Results

Volume 6: Ocean Space Utilization; Ocean Renewable Energy ◽

10.1115/omae2016-54984 ◽

2016 ◽

Cited By ~ 1

Author(s):

Bret Bosma ◽

Asher Simmons ◽

Pedro Lomonaco ◽

Kelley Ruehl ◽

Budi Gunawan

Keyword(s):

Experimental Data ◽

Open Source ◽

Wave Energy ◽

Phase 1 ◽

Wave Energy Converter ◽

Quality Data ◽

Energy Converter ◽

Data Set ◽

Wave Energy Converters ◽

Validation Plan

In the wave energy industry, there is a need for open source numerical codes and publicly available experimental data, both of which are being addressed through the development of WEC-Sim by Sandia National Laboratories and the National Renewable Energy Laboratory (NREL). WEC-Sim is an open source code used to model wave energy converters (WECs) when subject to incident waves. In order for the WEC-Sim code to be useful, code verification and physical model validation is necessary. This paper describes the wave tank testing for the 1:33 scale experiments of a Floating Oscillating Surge Wave Energy Converter (FOSWEC). The WEC-Sim experimental data set will help to advance the wave energy converter industry by providing a free, high-quality data set for researchers and developers. This paper describes the WEC-Sim open source wave energy converter simulation tool, experimental validation plan, and presents preliminary experimental results from the FOSWEC Phase 1 testing.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

A Benchmark and Evaluation of Non-Rigid Structure from Motion

International Journal of Computer Vision ◽

10.1007/s11263-020-01406-y ◽

2020 ◽

Author(s):

Sebastian Hoppe Nesgaard Jensen ◽

Mads Emil Brix Doest ◽

Henrik Aanæs ◽

Alessio Del Bue

Keyword(s):

Computer Vision ◽

Structure From Motion ◽

State Of The Art ◽

The State ◽

Quality Data ◽

Data Set ◽

Rigid Structure ◽

Public Data ◽

3D Information ◽

Further Development

AbstractNon-rigid structure from motion (nrsfm), is a long standing and central problem in computer vision and its solution is necessary for obtaining 3D information from multiple images when the scene is dynamic. A main issue regarding the further development of this important computer vision topic, is the lack of high quality data sets. We here address this issue by presenting a data set created for this purpose, which is made publicly available, and considerably larger than the previous state of the art. To validate the applicability of this data set, and provide an investigation into the state of the art of nrsfm, including potential directions forward, we here present a benchmark and a scrupulous evaluation using this data set. This benchmark evaluates 18 different methods with available code that reasonably spans the state of the art in sparse nrsfm. This new public data set and evaluation protocol will provide benchmark tools for further development in this challenging field.

Download Full-text

Volumetric Velocimetry Measurements of Film Cooling Jets

Journal of Engineering for Gas Turbines and Power ◽

10.1115/1.4041206 ◽

2018 ◽

Vol 141 (3) ◽

Author(s):

Artur Joao Carvalho Figueiredo ◽

Robin Jones ◽

Oliver J. Pountney ◽

James A. Scobie ◽

Gary D. Lock ◽

...

Keyword(s):

Film Cooling ◽

Momentum Flux ◽

Three Dimensional ◽

Flux Ratio ◽

Quality Data ◽

Data Set ◽

Momentum Flux Ratio ◽

Jet In Crossflow ◽

Lift Off ◽

Induced Velocity

This paper presents volumetric velocimetry (VV) measurements for a jet in crossflow that is representative of film cooling. VV employs particle tracking to nonintrusively extract all three components of velocity in a three-dimensional volume. This is its first use in a film-cooling context. The primary research objective was to develop this novel measurement technique for turbomachinery applications, while collecting a high-quality data set that can improve the understanding of the flow structure of the cooling jet. A new facility was designed and manufactured for this study with emphasis on optical access and controlled boundary conditions. For a range of momentum flux ratios from 0.65 to 6.5, the measurements clearly show the penetration of the cooling jet into the freestream, the formation of kidney-shaped vortices, and entrainment of main flow into the jet. The results are compared to published studies using different experimental techniques, with good agreement. Further quantitative analysis of the location of the kidney vortices demonstrates their lift off from the wall and increasing lateral separation with increasing momentum flux ratio. The lateral divergence correlates very well with the self-induced velocity created by the wall–vortex interaction. Circulation measurements quantify the initial roll up and decay of the kidney vortices and show that the point of maximum circulation moves downstream with increasing momentum flux ratio. The potential for nonintrusive VV measurements in turbomachinery flow has been clearly demonstrated.

Download Full-text

Wind Speed Retrieval from Simulated RADARSAT Constellation Mission Compact Polarimetry SAR Data for Marine Application

Remote Sensing ◽

10.3390/rs11141682 ◽

2019 ◽

Vol 11 (14) ◽

pp. 1682 ◽

Cited By ~ 1

Author(s):

Torsten Geldsetzer ◽

Shahid K. Khurshid ◽

Kerri Warner ◽

Filipe Botelho ◽

Dean Flett

Keyword(s):

Wind Speed ◽

Low Noise ◽

Quality Data ◽

Band Model ◽

Data Set ◽

Wind Retrieval ◽

Wind Speeds ◽

Geophysical Model ◽

The Right ◽

Quality Threshold

RADARSAT Constellation Mission (RCM) compact polarimetry (CP) data were simulated using 504 RADARSAT-2 quad-pol SAR images. These images were used to samples CP data in three RCM modes to build a data set with co-located ocean wind vector observations from in situ buoys on the West and East coasts of Canada. Wind speeds up to 18 m/s were included. CP and linear polarization parameters were related to the C-band model (CMOD) geophysical model functions CMOD-IFR2 and CMOD5n. These were evaluated for their wind retrieval potential in each RCM mode. The CP parameter Conformity was investigated to establish a data-quality threshold (>0.2), to ensure high-quality data for model validation. An accuracy analysis shows that the first Stokes vector (SV0) and the right-transmit vertical-receive backscatter (RV) parameters were as good as the VV backscatter with CMOD inversion. SV0 produced wind speed retrieval accuracies between 2.13 m/s and 2.22 m/s, depending on the RCM mode. The RCM Medium Resolution 50 m mode produced the best results. The Low Resolution 100 m and Low Noise modes provided similar results. The efficacy of SV0 and RV imparts confidence in the continuity of robust wind speed retrieval with RCM CP data. Three image-based case studies illustrate the potential for the application of CP parameters and RCM modes in operational wind retrieval systems. The results of this study provide guidance to direct research objectives once RCM is launched. The results also provide guidance for operational RCM data implementation in Canada’s National SAR winds system, which provides near-real-time wind speed estimates to operational marine forecasters and meteorologists within Environment and Climate Change Canada.

Download Full-text

Variations in optical properties of aerosols on monsoon seasonal change and estimation of aerosol optical depth using ground-based meteorological and air quality data

Atmospheric Chemistry and Physics Discussions ◽

10.5194/acpd-14-19747-2014 ◽

2014 ◽

Vol 14 (13) ◽

pp. 19747-19789

Author(s):

F. Tan ◽

H. S. Lim ◽

K. Abdullah ◽

T. L. Yoon ◽

B. Holben

Keyword(s):

Optical Properties ◽

Aerosol Optical Depth ◽

Optical Depth ◽

Statistical Tests ◽

Air Pollutant ◽

Quality Data ◽

Supplementary Information ◽

Northeast Monsoon ◽

Data Set ◽

Proposed Model

Abstract. In this study, the optical properties of aerosols in Penang, Malaysia were analyzed for four monsoonal seasons (northeast monsoon, pre-monsoon, southwest monsoon, and post-monsoon) based on data from the AErosol RObotic NETwork (AERONET) from February 2012 to November 2013. The aerosol distribution patterns in Penang for each monsoonal period were quantitatively identified according to the scattering plots of the aerosol optical depth (AOD) against the Angstrom exponent. A modified algorithm based on the prototype model of Tan et al. (2014a) was proposed to predict the AOD data. Ground-based measurements (i.e., visibility and air pollutant index) were used in the model as predictor data to retrieve the missing AOD data from AERONET because of frequent cloud formation in the equatorial region. The model coefficients were determined through multiple regression analysis using selected data set from in situ data. The predicted AOD of the model was generated based on the coefficients and compared against the measured data through standard statistical tests. The predicted AOD in the proposed model yielded a coefficient of determination R2 of 0.68. The corresponding percent mean relative error was less than 0.33% compared with the real data. The results revealed that the proposed model efficiently predicted the AOD data. Validation tests were performed on the model against selected LIDAR data and yielded good correspondence. The predicted AOD can beneficially monitor short- and long-term AOD and provide supplementary information in atmospheric corrections.

Download Full-text

ImageJ and 3D Slicer: open source 2/3D morphometric software

10.7287/peerj.preprints.27998v1 ◽

2019 ◽

Author(s):

Fiona Pye ◽

Nussaȉbah B Raja ◽

Bryan Shirley ◽

Ádám T Kocsis ◽

Niklas Hohmann ◽

...

Keyword(s):

Data Collection ◽

Open Source ◽

Programming Languages ◽

Quality Data ◽

Software Projects ◽

High Quality Data ◽

3D Slicer ◽

Data Collection And Analysis ◽

Digital Methods ◽

Imagej Plugin

In a world where an increasing number of resources are hidden behind paywalls and monthly subscriptions, it is becoming crucial for the scientific community to invest energy into freely available, community-maintained systems. Open-source software projects offer a solution, with freely available code which users can utilise and modify, under an open source licence. In addition to software accessibility and methodological repeatability, this also enables and encourages the development of new tools. As palaeontology moves towards data driven methodologies, it is becoming more important to acquire and provide high quality data through reproducible systematic procedures. Within the field of morphometrics, it is vital to adopt digital methods that help mitigate human bias from data collection. In addition,m mathematically founded approaches can reduce subjective decisions which plague classical data. This can be further developed through automation, which increases the efficiency of data collection and analysis. With these concepts in mind, we introduce two open-source shape analysis software, that arose from projects within the medical imaging field. These are ImageJ, an image processing program with batch processing features, and 3DSlicer which focuses on 3D informatics and visualisation. They are easily extensible using common programming languages, with 3DSlicer containing an internal python interactor, and ImageJ allowing the incorporation of several programming languages within its interface alongside its own simplified macro language. Additional features created by other users are readily available, on GitHub or through the software itself. In the examples presented, an ImageJ plugin “FossilJ” has been developed which provides semi-automated morphometric bivalve data collection. 3DSlicer is used with the extension SPHARM-PDM, applied to synchrotron scans of coniform conodonts for comparative morphometrics, for which small assistant tools have been created.

Download Full-text

Psi4 1.4: Open-Source Software for High-Throughput Quantum Chemistry

10.26434/chemrxiv.11930031.v1 ◽

2020 ◽

Author(s):

Daniel Smith ◽

Lori Burns ◽

Andrew Simmonett ◽

Robert Parrish ◽

Matthew Schieber ◽

...

Keyword(s):

Perturbation Theory ◽

Quantum Chemistry ◽

Open Source ◽

Density Functional ◽

Many Body ◽

Hartree Fock ◽

Large Numbers ◽

Many Body Perturbation Theory ◽

Infrastructure Project ◽

Job Specification

<div> <div> <div> <p>Psi4 is a free and open-source ab initio electronic structure program providing Hartree–Fock, density functional theory, many-body perturbation theory, configuration interaction, density cumulant theory, symmetry-adapted perturbation theory, and coupled-cluster theory. Most of the methods are quite efficient thanks to density fitting and multi-core parallelism. The program is a hybrid of C++ and Python, and calculations may be run with very simple text files or using the Python API, facilitating post-processing and complex workflows; method developers also have access to most of Psi4’s core functionality via Python. Job specification may be passed using The Molecular Sciences Software Institute (MolSSI) QCSchema data format, facilitating interoperability. A rewrite of our top-level computation driver, and concomitant adoption of the MolSSI QCArchive Infrastructure project, make the latest version of Psi4 well suited to distributed computation of large numbers of independent tasks. The project has fostered the development of independent software components that may be reused in other quantum chemistry programs. </p> </div> </div> </div>

Download Full-text

DCEMRI.jl: A fast, validated, open source toolkit for dynamic contrast enhanced MRI analysis

10.7287/peerj.preprints.670 ◽

2014 ◽

Author(s):

David S Smith ◽

Xia Li ◽

Lori R Arlinghaus ◽

Thomas E Yankeelov ◽

E. Brian Welch

Keyword(s):

Open Source ◽

Tissue Concentration ◽

Quantitative Imaging ◽

Imaging Biomarkers ◽

Problem Size ◽

Data Set ◽

Dynamic Contrast Enhanced ◽

Contrast Enhanced ◽

Run Time

We present a fast, validated, open-source toolkit for processing dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) data. We validate it against the Quantitative Imaging Biomarkers Alliance (QIBA) Standard and Extended Tofts-Kety phantoms and find near perfect recovery in the absence of noise, with an estimated 10-20x speedup in run time compared to existing tools. To explain the observed trends in the fitting errors, we present an argument about the conditioning of the Jacobian in the limit of small and large parameter values. We also demonstrate its use on an in vivo data set to measure performance on a realistic application. For a 192 x 192 breast image, we achieved run times of < 1 s. Finally, we analyze run times scaling with problem size and find that the run time per voxel scales as O(N1.9), where N is the number of time points in the tissue concentration curve. DCEMRI.jl was much faster than any other analysis package tested and produced comparable accuracy, even in the presence of noise.

Download Full-text