Efficient Design Of Peptide-Binding Polymers Using Active Learning Approaches

Active learning (AL) has become a subject of active recent research both in industry and academia as an efficient approach for rapid design and discovery of novel chemicals, materials, and polymers. The key advantages of this approach relate to its ability to (i) employ relatively small datasets for model development, (ii) iterate between model development and model assessment using small external datasets that can be either generated in focused experimental studies or formed from subsets of the initial training data, and (iii) progressively evolve models toward increasingly more reliable predictions and the identification of novel chemicals with the desired properties. Herein, we first compared various AL protocols for their effectiveness in finding biologically active molecules using synthetic datasets. We have investigated the dependency of AL performance on the size of the initial training set, the relative complexity of the task, and the choice of the initial training dataset. We found that AL techniques as applied to regression modeling offer no benefits over random search, while AL used for classification tasks performs better than models built for randomly selected training sets but still quite far from perfect. Using the best performing AL protocol, we have assessed the applicability of AL for the discovery of polymeric micelle formulations for poorly soluble drugs. Finally, the best performing AL approach was employed to discover and experimentally validate novel binding polymers for a case study of asialoglycoprotein receptor (ASGPR).

Download Full-text

New active learning algorithms for near-infrared spectroscopy in agricultural applications

at - Automatisierungstechnik ◽

10.1515/auto-2020-0143 ◽

2021 ◽

Vol 69 (4) ◽

pp. 297-306

Author(s):

Julius Krause ◽

Maurice Günder ◽

Daniel Schulz ◽

Robin Gruna

Keyword(s):

Active Learning ◽

Near Infrared ◽

Agricultural Products ◽

Training Data ◽

Calibration Model ◽

Learning Approaches ◽

Training Samples ◽

Agricultural Applications ◽

Selection Of

Abstract The selection of training data determines the quality of a chemometric calibration model. In order to cover the entire parameter space of known influencing parameters, an experimental design is usually created. Nevertheless, even with a carefully prepared Design of Experiment (DoE), redundant reference analyses are often performed during the analysis of agricultural products. Because the number of possible reference analyses is usually very limited, the presented active learning approaches are intended to provide a tool for better selection of training samples.

Download Full-text

HYBRID ACQUISITION OF HIGH QUALITY TRAINING DATA FOR SEMANTIC SEGMENTATION OF 3D POINT CLOUDS USING CROWD-BASED ACTIVE LEARNING

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-2-2020-501-2020 ◽

2020 ◽

Vol V-2-2020 ◽

pp. 501-508

Author(s):

M. Kölle ◽

V. Walter ◽

S. Schmohl ◽

U. Soergel

Keyword(s):

Active Learning ◽

Semantic Segmentation ◽

Point Clouds ◽

Human Interaction ◽

Training Data ◽

Semantic Interpretation ◽

Training Dataset ◽

3D Point Clouds ◽

Semantic Labeling ◽

3D Cnn

Abstract. Automated semantic interpretation of 3D point clouds is crucial for many tasks in the domain of geospatial data analysis. For this purpose, labeled training data is required, which has often to be provided manually by experts. One approach to minimize effort in terms of costs of human interaction is Active Learning (AL). The aim is to process only the subset of an unlabeled dataset that is particularly helpful with respect to class separation. Here a machine identifies informative instances which are then labeled by humans, thereby increasing the performance of the machine. In order to completely avoid involvement of an expert, this time-consuming annotation can be resolved via crowdsourcing. Therefore, we propose an approach combining AL with paid crowdsourcing. Although incorporating human interaction, our method can run fully automatically, so that only an unlabeled dataset and a fixed financial budget for the payment of the crowdworkers need to be provided. We conduct multiple iteration steps of the AL process on the ISPRS Vaihingen 3D Semantic Labeling benchmark dataset (V3D) and especially evaluate the performance of the crowd when labeling 3D points. We prove our concept by using labels derived from our crowd-based AL method for classifying the test dataset. The analysis outlines that by labeling only 0:4% of the training dataset by the crowd and spending less than 145 $, both our trained Random Forest and sparse 3D CNN classifier differ in Overall Accuracy by less than 3 percentage points compared to the same classifiers trained on the complete V3D training set.

Download Full-text

Combining Self-supervised Learning and Active Learning for Disfluency Detection

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3487290 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-25

Author(s):

Shaolei Wang ◽

Zhongyuan Wang ◽

Wanxiang Che ◽

Sendong Zhao ◽

Ting Liu

Keyword(s):

Neural Network ◽

Active Learning ◽

Supervised Learning ◽

Large Scale ◽

Training Data ◽

Fine Tuning ◽

Training Dataset ◽

Performance Gap ◽

Annotation Costs ◽

Trained Neural Network

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

Download Full-text

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

10.5194/egusphere-egu21-6853 ◽

2021 ◽

Author(s):

Rudy Venguswamy ◽

Mike Levy ◽

Anirudh Koul ◽

Satyarth Praveen ◽

Tarun Narayanan ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Open Source ◽

Forest Fires ◽

Seed Set ◽

Training Data ◽

Training Dataset ◽

Reference Image ◽

Query Image ◽

Real World Datasets

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.&#160;We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.&#160;In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers&#8217; specified &#8220;training budget.&#8221;). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image&#8217;s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p &#8776; 0.5).Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers&#8217; data curation efforts.

Download Full-text

Deep learning approaches for improving prediction of daily stream temperature in data-scarce, unmonitored, and dammed basins

10.22541/au.162184348.87839543/v1 ◽

2021 ◽

Author(s):

Farshid Rahmani ◽

Chaopeng Shen ◽

Samantha Oliver ◽

Kathryn Lawson ◽

Alison Appling

Keyword(s):

Short Term Memory ◽

Network Models ◽

Stream Temperature ◽

Data Availability ◽

Training Data ◽

Training Dataset ◽

Learning Approaches ◽

Input Selection ◽

Daily Sampling ◽

Lstm Network

Basin-centric long short-term memory (LSTM) network models have recently been shown to be an exceptionally powerful tool for simulating stream temperature (Ts, temperature measured in rivers), among other hydrological variables. However, spatial extrapolation is a well-known challenge to modeling Ts and it is uncertain how an LSTM-based daily Ts model will perform in unmonitored or dammed basins. Here we compiled a new benchmark dataset consisting of >400 basins for across the contiguous United States in different data availability groups (DAG, meaning the daily sampling frequency) with or without major dams and study how to assemble suitable training datasets for predictions in monitored or unmonitored situations. For temporal generalization, CONUS-median best root-mean-square error (RMSE) values for sites with extensive (99%), intermediate (60%), scarce (10%) and absent (0%, unmonitored) data for training were 0.75, 0.83, 0.88, and 1.59°C, representing the state of the art. For prediction in unmonitored basins (PUB), LSTM’s results surpassed those reported in the literature. Even for unmonitored basins with major reservoirs, we obtained a median RMSE of 1.492°C and an R2 of 0.966. The most suitable training set was the matching DAG that the basin could be grouped into, e.g., the 60% DAG for a basin with 61% data availability. However, for PUB, a training dataset including all basins with data is preferred. An input-selection ensemble moderately mitigated attribute overfitting. Our results suggest there are influential latent processes not sufficiently described by the inputs (e.g., geology, wetland covers), but temporal fluctuations are well predictable, and LSTM appears to be the more accurate Ts modeling tool when sufficient training data are available.

Download Full-text

Efficient data abstraction using weighted IB2 prototypes

Computer Science and Information Systems ◽

10.2298/csis140212036o ◽

2014 ◽

Vol 11 (2) ◽

pp. 665-678 ◽

Cited By ~ 3

Author(s):

Stefanos Ougiaroglou ◽

Georgios Evangelidis

Keyword(s):

Data Reduction ◽

Training Data ◽

Training Dataset ◽

Data Abstraction ◽

Prototype Selection ◽

Initial Training ◽

Reduction Techniques ◽

Efficient Data ◽

Data Reduction Technique ◽

Better Than

Data reduction techniques improve the efficiency of k-Nearest Neighbour classification on large datasets since they accelerate the classification process and reduce storage requirements for the training data. IB2 is an effective prototype selection data reduction technique. It selects some items from the initial training dataset and uses them as representatives (prototypes). Contrary to many other techniques, IB2 is a very fast, one-pass method that builds its reduced (condensing) set in an incremental manner. New training data can update the condensing set without the need of the ?old? removed items. This paper proposes a variation of IB2, that generates new prototypes instead of selecting them. The variation is called AIB2 and attempts to improve the efficiency of IB2 by positioning the prototypes in the center of the data areas they represent. The empirical experimental study conducted in the present work as well as the Wilcoxon signed ranks test show that AIB2 performs better than IB2.

Download Full-text

ACTIVE LEARNING TO EXTEND TRAINING DATA FOR LARGE AREA AIRBORNE LIDAR CLASSIFICATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-2-w13-1033-2019 ◽

2019 ◽

Vol XLII-2/W13 ◽

pp. 1033-1037

Author(s):

N. Li ◽

N. Pfeifer

Keyword(s):

Airborne Lidar ◽

Label Propagation ◽

Training Data ◽

Training Dataset ◽

Initial Training ◽

Large Area ◽

Training Set ◽

Extended Training ◽

Class Labels ◽

Area Classification

Abstract. Training dataset generation is a difficult and expensive task for LiDAR point classification, especially in the case of large area classification. We present a method to automatically extent a small set of training data by label propagation processing. The class labels could be correctly extended to their optimal neighbourhood, and the most informative points are selected and added into the training set. With the final extended training dataset, the overall (OA) classification could be increased by about 2%. We also show that this approach is stable regardless of the number of initial training points, and achieve better improvements especially stating with an extremely small initial training set.

Download Full-text

A Dynamic Improvement of a Training Dataset for Source Code Classification Using Deep Learning approach

Journal of University of Shanghai for Science and Technology ◽

10.51201/jusst/21/06501 ◽

2021 ◽

Vol 23 (06) ◽

pp. 10-22

Author(s):

Ms. Anshika Shukla ◽

◽

Mr. Sanjeev Kumar Shukla ◽

Keyword(s):

Deep Learning ◽

Classification Accuracy ◽

Source Code ◽

Training Data ◽

Classification Model ◽

Training Dataset ◽

Learning Approaches ◽

Dynamic Learning ◽

Data Set ◽

Learning Data

In recent years, there are various methods for source code classification using deep learning approaches have been proposed. The classification accuracy of the method using deep learning is greatly influenced by the training data set. Therefore, it is possible to create a model with higher accuracy by improving the construction method of the training data set. In this study, we propose a dynamic learning data set improvement method for source code classification using deep learning. In the proposed method, we first train and verify the source code classification model using the training data set. Next, we reconstruct the training data set based on the verification result. We create a high-precision model by repeating this learning and reconstruction and improving the learning data set. In the evaluation experiment, the source code classification model was learned using the proposed method, and the classification accuracy was compared with the three baseline methods. As a result, it was found that the model learned using the proposed method has the highest classification accuracy. We also confirmed that the proposed method improves the classification accuracy of the model from 0.64 to 0.96

Download Full-text

Initial training data selection for active learning

Proceedings of the 5th International Confernece on Ubiquitous Information Management and Communication - ICUIMC '11 ◽

10.1145/1968613.1968619 ◽

2011 ◽

Cited By ~ 5

Author(s):

Weiwei Yuan ◽

Yongkoo Han ◽

Donghai Guan ◽

Sungyoung Lee ◽

Young-Koo Lee

Keyword(s):

Active Learning ◽

Training Data ◽

Data Selection ◽

Initial Training ◽

Selection For ◽

Training Data Selection

Download Full-text

Automatic Melodic Harmonization

Advances in Multimedia and Interactive Technologies - Trends in Music Information Seeking, Behavior, and Retrieval for Creativity ◽

10.4018/978-1-5225-0270-8.ch008 ◽

2016 ◽

pp. 146-165

Author(s):

Dimos Makris ◽

Ioannis Kayrdis ◽

Spyros Sioutas

Keyword(s):

Statistical Information ◽

Training Data ◽

Training Dataset ◽

Learning Approaches ◽

Future Directions ◽

Probabilistic Systems ◽

Music Information ◽

Musical Chords ◽

Research Domain ◽

Shed Light

Automatic melodic harmonization tackles the assignment of harmony content (musical chords) over a given melody. Probabilistic approaches to melodic harmonization utilize statistical information derived from a training dataset, producing harmonies that encapsulate some harmonic characteristics of the training dataset. Training data is usually annotated symbolic musical notation. In addition to the obvious musicological interest, different machine learning approaches and algorithms have been proposed for such a task, strengthening thus the challenge of efficient and effective music information utilization using probabilistic systems. Consequently, the aim of this chapter is to provide an overview of the specific research domain as well as to shed light on the subtasks that have arisen and since evolved. Finally, new trends and future directions are discussed along with the challenges which still remain unsolved.

Download Full-text