Generating Fixed-Size Training Sets for Large and Streaming Datasets

Author(s):  
Stefanos Ougiaroglou ◽  
Georgios Arampatzis ◽  
Dimitris A. Dervos ◽  
Georgios Evangelidis
Keyword(s):  
2019 ◽  
Author(s):  
Qi Yuan ◽  
Alejandro Santana-Bonilla ◽  
Martijn Zwijnenburg ◽  
Kim Jelfs

<p>The chemical space for novel electronic donor-acceptor oligomers with targeted properties was explored using deep generative models and transfer learning. A General Recurrent Neural Network model was trained from the ChEMBL database to generate chemically valid SMILES strings. The parameters of the General Recurrent Neural Network were fine-tuned via transfer learning using the electronic donor-acceptor database from the Computational Material Repository to generate novel donor-acceptor oligomers. Six different transfer learning models were developed with different subsets of the donor-acceptor database as training sets. We concluded that electronic properties such as HOMO-LUMO gaps and dipole moments of the training sets can be learned using the SMILES representation with deep generative models, and that the chemical space of the training sets can be efficiently explored. This approach identified approximately 1700 new molecules that have promising electronic properties (HOMO-LUMO gap <2 eV and dipole moment <2 Debye), 6-times more than in the original database. Amongst the molecular transformations, the deep generative model has learned how to produce novel molecules by trading off between selected atomic substitutions (such as halogenation or methylation) and molecular features such as the spatial extension of the oligomer. The method can be extended as a plausible source of new chemical combinations to effectively explore the chemical space for targeted properties.</p>


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yersultan Mirasbekov ◽  
Adina Zhumakhanova ◽  
Almira Zhantuyakova ◽  
Kuanysh Sarkytbayev ◽  
Dmitry V. Malashenkov ◽  
...  

AbstractA machine learning approach was employed to detect and quantify Microcystis colonial morphospecies using FlowCAM-based imaging flow cytometry. The system was trained and tested using samples from a long-term mesocosm experiment (LMWE, Central Jutland, Denmark). The statistical validation of the classification approaches was performed using Hellinger distances, Bray–Curtis dissimilarity, and Kullback–Leibler divergence. The semi-automatic classification based on well-balanced training sets from Microcystis seasonal bloom provided a high level of intergeneric accuracy (96–100%) but relatively low intrageneric accuracy (67–78%). Our results provide a proof-of-concept of how machine learning approaches can be applied to analyze the colonial microalgae. This approach allowed to evaluate Microcystis seasonal bloom in individual mesocosms with high level of temporal and spatial resolution. The observation that some Microcystis morphotypes completely disappeared and re-appeared along the mesocosm experiment timeline supports the hypothesis of the main transition pathways of colonial Microcystis morphoforms. We demonstrated that significant changes in the training sets with colonial images required for accurate classification of Microcystis spp. from time points differed by only two weeks due to Microcystis high phenotypic heterogeneity during the bloom. We conclude that automatic methods not only allow a performance level of human taxonomist, and thus be a valuable time-saving tool in the routine-like identification of colonial phytoplankton taxa, but also can be applied to increase temporal and spatial resolution of the study.


Author(s):  
Robin Pla ◽  
Thibaut Ledanois ◽  
Escobar David Simbana ◽  
Anaël Aubry ◽  
Benjamin Tranchard ◽  
...  

The main aim of this study was to evaluate the validity and the reliability of a swimming sensor to assess swimming performance and spatial-temporal variables. Six international male open-water swimmers completed a protocol which consisted of two training sets: a 6×100m individual medley and a continuous 800 m set in freestyle. Swimmers were equipped with a wearable sensor, the TritonWear to collect automatically spatial-temporal variables: speed, lap time, stroke count (SC), stroke length (SL), stroke rate (SR), and stroke index (SI). Video recordings were added as a “gold-standard” and used to assess the validity and the reliability of the TritonWear sensor. The results show that the sensor provides accurate results in comparison with video recording measurements. A very high accuracy was observed for lap time with a mean absolute percentage error (MAPE) under 5% for each stroke (2.2, 3.2, 3.4, 4.1% for butterfly, backstroke, breaststroke and freestyle respectively) but high error ranges indicate a dependence on swimming technique. Stroke count accuracy was higher for symmetric strokes than for alternate strokes (MAPE: 0, 2.4, 7.1 & 4.9% for butterfly, breaststroke, backstroke & freestyle respectively). The other variables (SL, SR & SI) derived from the SC and the lap time also show good accuracy in all strokes. The wearable sensor provides an accurate real time feedback of spatial-temporal variables in six international open-water swimmers during classical training sets (at low to moderate intensities), which could be a useful tool for coaches, allowing them to monitor training load with no effort.


2021 ◽  
Vol 40 (1) ◽  
pp. 551-563
Author(s):  
Liqiong Lu ◽  
Dong Wu ◽  
Ziwei Tang ◽  
Yaohua Yi ◽  
Faliang Huang

This paper focuses on script identification in natural scene images. Traditional CNNs (Convolution Neural Networks) cannot solve this problem perfectly for two reasons: one is the arbitrary aspect ratios of scene images which bring much difficulty to traditional CNNs with a fixed size image as the input. And the other is that some scripts with minor differences are easily confused because they share a subset of characters with the same shapes. We propose a novel approach combing Score CNN, Attention CNN and patches. Attention CNN is utilized to determine whether a patch is a discriminative patch and calculate the contribution weight of the discriminative patch to script identification of the whole image. Score CNN uses a discriminative patch as input and predict the score of each script type. Firstly patches with the same size are extracted from the scene images. Secondly these patches are used as inputs to Score CNN and Attention CNN to train two patch-level classifiers. Finally, the results of multiple discriminative patches extracted from the same image via the above two classifiers are fused to obtain the script type of this image. Using patches with the same size as inputs to CNN can avoid the problems caused by arbitrary aspect ratios of scene images. The trained classifiers can mine discriminative patches to accurately identify some confusing scripts. The experimental results show the good performance of our approach on four public datasets.


2021 ◽  
Vol 118 (24) ◽  
pp. 241108
Author(s):  
E. Mejia ◽  
Y. Qian ◽  
S. A. Safiabadi Tali ◽  
J. Song ◽  
W. Zhou

Sign in / Sign up

Export Citation Format

Share Document