Steep channel freezeup processes: understanding complexity with statistical and physical models

The Wilkie, Stonham, and Aleksander recognition device (WiSARD) [Formula: see text]-tuple classifier is a multiclass weightless neural network capable of learning a given pattern in a single step. Its architecture is determined by the number of classes it should discriminate. A target class is represented by a structure called a discriminator, which is composed of [Formula: see text] RAM nodes, each of them addressed by an [Formula: see text]-tuple. Previous studies were carried out in order to mitigate an important problem of the WiSARD [Formula: see text]-tuple classifier: having its RAM nodes saturated when trained by a large data set. Finding the VC dimension of the WiSARD [Formula: see text]-tuple classifier was one of those studies. Although no exact value was found, tight bounds were discovered. Later, the bleaching technique was proposed as a means to avoid saturation. Recent empirical results with the bleaching extension showed that the WiSARD [Formula: see text]-tuple classifier can achieve high accuracies with low variance in a great range of tasks. Theoretical studies had not been conducted with that extension previously. This work presents the exact VC dimension of the basic two-class WiSARD [Formula: see text]-tuple classifier, which is linearly proportional to the number of RAM nodes belonging to a discriminator, and exponentially to their addressing tuple length, precisely [Formula: see text]. The exact VC dimension of the bleaching extension to the WiSARD [Formula: see text]-tuple classifier, whose value is the same as that of the basic model, is also produced. Such a result confirms that the bleaching technique is indeed an enhancement to the basic WiSARD [Formula: see text]-tuple classifier as it does no harm to the generalization capability of the original paradigm.

Download Full-text

Molecular Transformer-aided Biocatalysed Synthesis Planning

10.26434/chemrxiv.14639007 ◽

2021 ◽

Author(s):

Daniel Probst ◽

Matteo Manica ◽

Yves Gaëtan Nana Teukam ◽

Alessandro Castrogiovanni ◽

Federico Paratore ◽

...

Keyword(s):

Green Chemistry ◽

Enzymatic Catalysis ◽

Large Data ◽

Single Step ◽

Specific Knowledge ◽

Data Set ◽

Domain Specific ◽

Forward Prediction ◽

Pathway Prediction ◽

Domain Specific Knowledge

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of enzymes on unreported substrates and their specific stereo- and regioselectivity are domain-specific knowledge factors that require decades of field experience to master. This makes the retrosynthesis of given targets with biocatalysed reactions a significant challenge. Here, we use the molecular transformer architecture to capture the latent knowledge about enzymatic activity from a large data set of publicly available biochemical reactions, extending forward reaction and retrosynthetic pathway prediction to the domain of biocatalysis. We introduce the use of a class token based on the EC classification scheme that allows to capture catalysis patterns among different enzymes belonging to the same hierarchical families. The forward prediction model achieves an accuracy of 49.6% and 62.7%, top-1 and top-5 respectively, while the single-step retrosynthetic model shows a round-trip accuracy of 39.6% and 42.6%, top-1 and top-10 respectively. Trained models and curated data are made publicly available with the hope of promoting enzymatic catalysis and making green chemistry more accessible through the use of digital technologies.

Download Full-text

Understanding the Variability in Graph Data Sets through Statistical Modeling on the Stiefel Manifold

Entropy ◽

10.3390/e23040490 ◽

2021 ◽

Vol 23 (4) ◽

pp. 490

Author(s):

Clément Mantoux ◽

Baptiste Couvy-Duchesne ◽

Federica Cacciamani ◽

Stéphane Epelbaum ◽

Stanley Durrleman ◽

...

Keyword(s):

Degrees Of Freedom ◽

Brain Connectivity ◽

Random Perturbation ◽

Synthetic Data ◽

Large Data ◽

Stiefel Manifold ◽

Data Set ◽

Model Complex ◽

Rank One ◽

The Uk

Network analysis provides a rich framework to model complex phenomena, such as human brain connectivity. It has proven efficient to understand their natural properties and design predictive models. In this paper, we study the variability within groups of networks, i.e., the structure of connection similarities and differences across a set of networks. We propose a statistical framework to model these variations based on manifold-valued latent factors. Each network adjacency matrix is decomposed as a weighted sum of matrix patterns with rank one. Each pattern is described as a random perturbation of a dictionary element. As a hierarchical statistical model, it enables the analysis of heterogeneous populations of adjacency matrices using mixtures. Our framework can also be used to infer the weight of missing edges. We estimate the parameters of the model using an Expectation-Maximization-based algorithm. Experimenting on synthetic data, we show that the algorithm is able to accurately estimate the latent structure in both low and high dimensions. We apply our model on a large data set of functional brain connectivity matrices from the UK Biobank. Our results suggest that the proposed model accurately describes the complex variability in the data set with a small number of degrees of freedom.

Download Full-text

Molecular Transformer-aided Biocatalysed Synthesis Planning

10.26434/chemrxiv.14639007.v1 ◽

2021 ◽

Author(s):

Daniel Probst ◽

Matteo Manica ◽

Yves Gaëtan Nana Teukam ◽

Alessandro Castrogiovanni ◽

Federico Paratore ◽

...

Keyword(s):

Green Chemistry ◽

Enzymatic Catalysis ◽

Large Data ◽

Single Step ◽

Specific Knowledge ◽

Data Set ◽

Domain Specific ◽

Forward Prediction ◽

Pathway Prediction ◽

Domain Specific Knowledge

Enzyme catalysts are an integral part of green chemistry strategies towards a more sustainable and resource-efficient chemical synthesis. However, the use of enzymes on unreported substrates and their specific stereo- and regioselectivity are domain-specific knowledge factors that require decades of field experience to master. This makes the retrosynthesis of given targets with biocatalysed reactions a significant challenge. Here, we use the molecular transformer architecture to capture the latent knowledge about enzymatic activity from a large data set of publicly available biochemical reactions, extending forward reaction and retrosynthetic pathway prediction to the domain of biocatalysis. We introduce the use of a class token based on the EC classification scheme that allows to capture catalysis patterns among different enzymes belonging to the same hierarchical families. The forward prediction model achieves an accuracy of 49.6% and 62.7%, top-1 and top-5 respectively, while the single-step retrosynthetic model shows a round-trip accuracy of 39.6% and 42.6%, top-1 and top-10 respectively. Trained models and curated data are made publicly available with the hope of promoting enzymatic catalysis and making green chemistry more accessible through the use of digital technologies.

Download Full-text

Cryo-TEM of amphiphilic polymer and amphiphile/polymer solutions

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100150216 ◽

1993 ◽

Vol 51 ◽

pp. 876-877

Author(s):

Yeshayahu Talmon

Keyword(s):

Microstructural Characterization ◽

Building Blocks ◽

Quantitative Information ◽

Physical Models ◽

Specimen Preparation ◽

Time Resolved ◽

Temperature Changes ◽

Temperature And Humidity ◽

Microstructured Fluids ◽

Air Streams

To achieve complete microstructural characterization of self-aggregating systems, one needs direct images in addition to quantitative information from non-imaging, e.g., scattering or Theological measurements, techniques. Cryo-TEM enables us to image fluid microstructures at better than one nanometer resolution, with minimal specimen preparation artifacts. Direct images are used to determine the “building blocks” of the fluid microstructure; these are used to build reliable physical models with which quantitative information from techniques such as small-angle x-ray or neutron scattering can be analyzed.To prepare vitrified specimens of microstructured fluids, we have developed the Controlled Environment Vitrification System (CEVS), that enables us to prepare samples under controlled temperature and humidity conditions, thus minimizing microstructural rearrangement due to volatile evaporation or temperature changes. The CEVS may be used to trigger on-the-grid processes to induce formation of new phases, or to study intermediate, transient structures during change of phase (“time-resolved cryo-TEM”). Recently we have developed a new CEVS, where temperature and humidity are controlled by continuous flow of a mixture of humidified and dry air streams.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

The Complete DNA Sequence of the Mitochondrial Genome of a “Living Fossil,” the Coelacanth (Latimeria chalumnae)

Genetics ◽

10.1093/genetics/146.3.995 ◽

1997 ◽

Vol 146 (3) ◽

pp. 995-1010 ◽

Cited By ~ 1

Author(s):

Rafael Zardoya ◽

Axel Meyer

Keyword(s):

Mitochondrial Genome ◽

Tandem Repeats ◽

Phylogenetic Analyses ◽

Large Data ◽

Molecular Data ◽

Phylogenetic Position ◽

Data Set ◽

Living Fossil ◽

Latimeria Chalumnae ◽

Relationship Of

The complete nucleotide sequence of the 16,407-bp mitochondrial genome of the coelacanth (Latimeria chalumnae) was determined. The coelacanth mitochondrial genome order is identical to the consensus vertebrate gene order which is also found in all ray-finned fishes, the lungfish, and most tetrapods. Base composition and codon usage also conform to typical vertebrate patterns. The entire mitochondrial genome was PCR-amplified with 24 sets of primers that are expected to amplify homologous regions in other related vertebrate species. Analyses of the control region of the coelacanth mitochondrial genome revealed the existence of four 22-bp tandem repeats close to its 3′ end. The phylogenetic analyses of a large data set combining genes coding for rRNAs, tRNA, and proteins (16,140 characters) confirmed the phylogenetic position of the coelacanth as a lobe-finned fish; it is more closely related to tetrapods than to ray-finned fishes. However, different phylogenetic methods applied to this largest available molecular data set were unable to resolve unambiguously the relationship of the coelacanth to the two other groups of extant lobe-finned fishes, the lungfishes and the tetrapods. Maximum parsimony favored a lungfish/coelacanth or a lungfish/tetrapod sistergroup relationship depending on which transversion:transition weighting is assumed. Neighbor-joining and maximum likelihood supported a lungfish/tetrapod sistergroup relationship.

Download Full-text