supervised learning
Recently Published Documents





Shaolei Wang ◽  
Zhongyuan Wang ◽  
Wanxiang Che ◽  
Sendong Zhao ◽  
Ting Liu

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

2022 ◽  
Vol 16 (2) ◽  
pp. 1-27
Yang Yang ◽  
Hongchen Wei ◽  
Zhen-Qiang Sun ◽  
Guang-Yu Li ◽  
Yuanchun Zhou ◽  

Open set classification (OSC) tackles the problem of determining whether the data are in-class or out-of-class during inference, when only provided with a set of in-class examples at training time. Traditional OSC methods usually train discriminative or generative models with the owned in-class data, and then utilize the pre-trained models to classify test data directly. However, these methods always suffer from the embedding confusion problem, i.e., partial out-of-class instances are mixed with in-class ones of similar semantics, making it difficult to classify. To solve this problem, we unify semi-supervised learning to develop a novel OSC algorithm, S2OSC, which incorporates out-of-class instances filtering and model re-training in a transductive manner. In detail, given a pool of newly coming test data, S2OSC firstly filters the mostly distinct out-of-class instances using the pre-trained model, and annotates super-class for them. Then, S2OSC trains a holistic classification model by combing in-class and out-of-class labeled data with the remaining unlabeled test data in a semi-supervised paradigm. Furthermore, considering that data are usually in the streaming form in real applications, we extend S2OSC into an incremental update framework (I-S2OSC), and adopt a knowledge memory regularization to mitigate the catastrophic forgetting problem in incremental update. Despite the simplicity of proposed models, the experimental results show that S2OSC achieves state-of-the-art performance across a variety of OSC tasks, including 85.4% of F1 on CIFAR-10 with only 300 pseudo-labels. We also demonstrate how S2OSC can be expanded to incremental OSC setting effectively with streaming data.

2022 ◽  
Vol 237 ◽  
pp. 111561
Chiwu Bu ◽  
Tao Liu ◽  
Rui Li ◽  
Runhong Shen ◽  
Bo Zhao ◽  

2022 ◽  
Vol 11 (1) ◽  
pp. 325-337
Natalia Gil ◽  
Marcelo Albuquerque ◽  
Gabriela de

<p style="text-align: justify;">The article aims to develop a machine-learning algorithm that can predict student’s graduation in the Industrial Engineering course at the Federal University of Amazonas based on their performance data. The methodology makes use of an information package of 364 students with an admission period between 2007 and 2019, considering characteristics that can affect directly or indirectly in the graduation of each one, being: type of high school, number of semesters taken, grade-point average, lockouts, dropouts and course terminations. The data treatment considered the manual removal of several characteristics that did not add value to the output of the algorithm, resulting in a package composed of 2184 instances. Thus, the logistic regression, MLP and XGBoost models developed and compared could predict a binary output of graduation or non-graduation to each student using 30% of the dataset to test and 70% to train, so that was possible to identify a relationship between the six attributes explored and achieve, with the best model, 94.15% of accuracy on its predictions.</p>

Tilman Krokotsch ◽  
Mirko Knaak ◽  
Clemens G¨uhmann

RUL estimation plays a vital role in effectively scheduling maintenance operations. Unfortunately, it suffers from a severe data imbalance where data from machines near their end of life is rare. Additionally, the data produced by a machine can only be labeled after the machine failed. Both of these points make using data-driven methods for RUL estimation difficult. Semi-Supervised Learning (SSL) can incorporate the unlabeled data produced by machines that did not yet fail into data-driven methods. Previous work on SSL evaluated approaches under unrealistic conditions where the data near failure was still available. Even so, only moderate improvements were made. This paper defines more realistic evaluation conditions and proposes a novel SSL approach based on self-supervised pre-training. The method can outperform two competing approaches from the literature and the supervised baseline on the NASA Commercial Modular Aero-Propulsion System Simulation dataset.

2022 ◽  
Binglin Xie ◽  
Xianhua Yao ◽  
Weining Mao ◽  
Mohammad Rafiei ◽  
Nan Hu

Abstract Modern AI-assisted approaches have helped material scientists revolutionize their abilities to better understand the properties of materials. However, current machine learning (ML) models would perform awful for materials with a lengthy production window and a complex testing procedure because only a limited amount of data can be produced to feed the model. Here, we introduce self-supervised learning (SSL) to address the issue of lacking labeled data in material characterization. We propose a generalized SSL-based framework with domain knowledge and demonstrate its robustness to predict the properties of a candidate material with the fewest data. Our numerical results show that the performance of the proposed SSL model can match the commonly-used supervised learning (SL) model with only 5 % of data, and the SSL model is also proven with ease of implementation. Our study paves the way to expand further the usability of ML tools for a broader material science community.

Sign in / Sign up

Export Citation Format

Share Document