scholarly journals Efficient Design Of Peptide-Binding Polymers Using Active Learning Approaches

2021 ◽  
Author(s):  
Assima Rakhimbekova ◽  
Anton Lopukhov ◽  
Natalia L. Klyachko ◽  
Alexander Kabanov ◽  
Timur I. Madzhidov ◽  
...  

Active learning (AL) has become a subject of active recent research both in industry and academia as an efficient approach for rapid design and discovery of novel chemicals, materials, and polymers. The key advantages of this approach relate to its ability to (i) employ relatively small datasets for model development, (ii) iterate between model development and model assessment using small external datasets that can be either generated in focused experimental studies or formed from subsets of the initial training data, and (iii) progressively evolve models toward increasingly more reliable predictions and the identification of novel chemicals with the desired properties. Herein, we first compared various AL protocols for their effectiveness in finding biologically active molecules using synthetic datasets. We have investigated the dependency of AL performance on the size of the initial training set, the relative complexity of the task, and the choice of the initial training dataset. We found that AL techniques as applied to regression modeling offer no benefits over random search, while AL used for classification tasks performs better than models built for randomly selected training sets but still quite far from perfect. Using the best performing AL protocol, we have assessed the applicability of AL for the discovery of polymeric micelle formulations for poorly soluble drugs. Finally, the best performing AL approach was employed to discover and experimentally validate novel binding polymers for a case study of asialoglycoprotein receptor (ASGPR).

2021 ◽  
Vol 69 (4) ◽  
pp. 297-306
Author(s):  
Julius Krause ◽  
Maurice Günder ◽  
Daniel Schulz ◽  
Robin Gruna

Abstract The selection of training data determines the quality of a chemometric calibration model. In order to cover the entire parameter space of known influencing parameters, an experimental design is usually created. Nevertheless, even with a carefully prepared Design of Experiment (DoE), redundant reference analyses are often performed during the analysis of agricultural products. Because the number of possible reference analyses is usually very limited, the presented active learning approaches are intended to provide a tool for better selection of training samples.


Author(s):  
M. Kölle ◽  
V. Walter ◽  
S. Schmohl ◽  
U. Soergel

Abstract. Automated semantic interpretation of 3D point clouds is crucial for many tasks in the domain of geospatial data analysis. For this purpose, labeled training data is required, which has often to be provided manually by experts. One approach to minimize effort in terms of costs of human interaction is Active Learning (AL). The aim is to process only the subset of an unlabeled dataset that is particularly helpful with respect to class separation. Here a machine identifies informative instances which are then labeled by humans, thereby increasing the performance of the machine. In order to completely avoid involvement of an expert, this time-consuming annotation can be resolved via crowdsourcing. Therefore, we propose an approach combining AL with paid crowdsourcing. Although incorporating human interaction, our method can run fully automatically, so that only an unlabeled dataset and a fixed financial budget for the payment of the crowdworkers need to be provided. We conduct multiple iteration steps of the AL process on the ISPRS Vaihingen 3D Semantic Labeling benchmark dataset (V3D) and especially evaluate the performance of the crowd when labeling 3D points. We prove our concept by using labels derived from our crowd-based AL method for classifying the test dataset. The analysis outlines that by labeling only 0:4% of the training dataset by the crowd and spending less than 145 $, both our trained Random Forest and sparse 3D CNN classifier differ in Overall Accuracy by less than 3 percentage points compared to the same classifiers trained on the complete V3D training set.


Author(s):  
Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.


2021 ◽  
Author(s):  
Rudy Venguswamy ◽  
Mike Levy ◽  
Anirudh Koul ◽  
Satyarth Praveen ◽  
Tarun Narayanan ◽  
...  

<p>Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack. </p><p>We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them. </p><p>In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers’ specified “training budget.”). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image’s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p ≈ 0.5).</p><p>Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.</p><p>By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers’ data curation efforts.</p>


Author(s):  
Farshid Rahmani ◽  
Chaopeng Shen ◽  
Samantha Oliver ◽  
Kathryn Lawson ◽  
Alison Appling

Basin-centric long short-term memory (LSTM) network models have recently been shown to be an exceptionally powerful tool for simulating stream temperature (Ts, temperature measured in rivers), among other hydrological variables. However, spatial extrapolation is a well-known challenge to modeling Ts and it is uncertain how an LSTM-based daily Ts model will perform in unmonitored or dammed basins. Here we compiled a new benchmark dataset consisting of >400 basins for across the contiguous United States in different data availability groups (DAG, meaning the daily sampling frequency) with or without major dams and study how to assemble suitable training datasets for predictions in monitored or unmonitored situations. For temporal generalization, CONUS-median best root-mean-square error (RMSE) values for sites with extensive (99%), intermediate (60%), scarce (10%) and absent (0%, unmonitored) data for training were 0.75, 0.83, 0.88, and 1.59°C, representing the state of the art. For prediction in unmonitored basins (PUB), LSTM’s results surpassed those reported in the literature. Even for unmonitored basins with major reservoirs, we obtained a median RMSE of 1.492°C and an R2 of 0.966. The most suitable training set was the matching DAG that the basin could be grouped into, e.g., the 60% DAG for a basin with 61% data availability. However, for PUB, a training dataset including all basins with data is preferred. An input-selection ensemble moderately mitigated attribute overfitting. Our results suggest there are influential latent processes not sufficiently described by the inputs (e.g., geology, wetland covers), but temporal fluctuations are well predictable, and LSTM appears to be the more accurate Ts modeling tool when sufficient training data are available.


2014 ◽  
Vol 11 (2) ◽  
pp. 665-678 ◽  
Author(s):  
Stefanos Ougiaroglou ◽  
Georgios Evangelidis

Data reduction techniques improve the efficiency of k-Nearest Neighbour classification on large datasets since they accelerate the classification process and reduce storage requirements for the training data. IB2 is an effective prototype selection data reduction technique. It selects some items from the initial training dataset and uses them as representatives (prototypes). Contrary to many other techniques, IB2 is a very fast, one-pass method that builds its reduced (condensing) set in an incremental manner. New training data can update the condensing set without the need of the ?old? removed items. This paper proposes a variation of IB2, that generates new prototypes instead of selecting them. The variation is called AIB2 and attempts to improve the efficiency of IB2 by positioning the prototypes in the center of the data areas they represent. The empirical experimental study conducted in the present work as well as the Wilcoxon signed ranks test show that AIB2 performs better than IB2.


Author(s):  
N. Li ◽  
N. Pfeifer

<p><strong>Abstract.</strong> Training dataset generation is a difficult and expensive task for LiDAR point classification, especially in the case of large area classification. We present a method to automatically extent a small set of training data by label propagation processing. The class labels could be correctly extended to their optimal neighbourhood, and the most informative points are selected and added into the training set. With the final extended training dataset, the overall (OA) classification could be increased by about 2%. We also show that this approach is stable regardless of the number of initial training points, and achieve better improvements especially stating with an extremely small initial training set.</p>


2021 ◽  
Vol 23 (06) ◽  
pp. 10-22
Author(s):  
Ms. Anshika Shukla ◽  
◽  
Mr. Sanjeev Kumar Shukla ◽  

In recent years, there are various methods for source code classification using deep learning approaches have been proposed. The classification accuracy of the method using deep learning is greatly influenced by the training data set. Therefore, it is possible to create a model with higher accuracy by improving the construction method of the training data set. In this study, we propose a dynamic learning data set improvement method for source code classification using deep learning. In the proposed method, we first train and verify the source code classification model using the training data set. Next, we reconstruct the training data set based on the verification result. We create a high-precision model by repeating this learning and reconstruction and improving the learning data set. In the evaluation experiment, the source code classification model was learned using the proposed method, and the classification accuracy was compared with the three baseline methods. As a result, it was found that the model learned using the proposed method has the highest classification accuracy. We also confirmed that the proposed method improves the classification accuracy of the model from 0.64 to 0.96


Author(s):  
Dimos Makris ◽  
Ioannis Kayrdis ◽  
Spyros Sioutas

Automatic melodic harmonization tackles the assignment of harmony content (musical chords) over a given melody. Probabilistic approaches to melodic harmonization utilize statistical information derived from a training dataset, producing harmonies that encapsulate some harmonic characteristics of the training dataset. Training data is usually annotated symbolic musical notation. In addition to the obvious musicological interest, different machine learning approaches and algorithms have been proposed for such a task, strengthening thus the challenge of efficient and effective music information utilization using probabilistic systems. Consequently, the aim of this chapter is to provide an overview of the specific research domain as well as to shed light on the subtasks that have arisen and since evolved. Finally, new trends and future directions are discussed along with the challenges which still remain unsolved.


Sign in / Sign up

Export Citation Format

Share Document