Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

Download Full-text

Generative Pre-Training from Molecules

10.33774/chemrxiv-2021-5fwjd ◽

2021 ◽

Author(s):

Sanjar Adilov

Keyword(s):

Language Processing ◽

State Of The Art ◽

Language Model ◽

Molecular Data ◽

Fine Tuning ◽

Model Parameters ◽

Property Prediction ◽

Machine Learning Methods ◽

Recent Success ◽

Language Construct

SMILES is a line notation for entering and representing molecules. Being inherently a language construct, it allows estimating molecular data in a self-supervised fashion by employing machine learning methods for natural language processing (NLP). The recent success of attention-based neural networks in NLP has made large-corpora transformer pretraining a de facto standard for learning representations and transferring knowledge to downstream tasks. In this work, we attempt to adapt transformer capabilities to a large SMILES corpus by constructing a GPT-2-like language model. We experimentally show that a pretrained causal transformer captures general knowledge that can be successfully transferred to such downstream tasks as focused molecule generation and single-/multi-output molecular-property prediction. For each task, we freeze model parameters and attach trainable lightweight networks between attention blocks—adapters—as alternative to fine-tuning. With a relatively modest setup, our transformer outperforms the recently proposed ChemBERTa transformer and approaches state-of-the-art MoleculeNet and Chemprop results. Overall, transformers pretrained on SMILES corpora are promising alternatives that do not require handcrafted feature engineering, make few assumptions about structure of data, and scale well with the pretraining data size.

Download Full-text

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Applied Sciences ◽

10.3390/app11010024 ◽

2020 ◽

Vol 11 (1) ◽

pp. 24

Author(s):

Jin Tao ◽

Kelly Brayton ◽

Shira Broschat

Keyword(s):

Language Processing ◽

Fine Tuning ◽

Support Vector ◽

Protein Annotation ◽

Computing Power ◽

Journal Publication ◽

Novel Approach ◽

Uniprotkb Database ◽

Public Repositories ◽

Annotation Errors

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.

Download Full-text

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01459-0 ◽

2021 ◽

Vol 21 (S2) ◽

Author(s):

Feihong Yang ◽

Xuwen Wang ◽

Hetong Ma ◽

Jiao Li

Keyword(s):

Language Processing ◽

Pearson Correlation ◽

Fine Tuning ◽

Entity Recognition ◽

Training Dataset ◽

Training Methods ◽

Code Size ◽

Model Framework ◽

Language Understanding ◽

Medical Language

Abstract Background Transformer is an attention-based architecture proven the state-of-the-art model in natural language processing (NLP). To reduce the difficulty of beginning to use transformer-based models in medical language understanding and expand the capability of the scikit-learn toolkit in deep learning, we proposed an easy to learn Python toolkit named transformers-sklearn. By wrapping the interfaces of transformers in only three functions (i.e., fit, score, and predict), transformers-sklearn combines the advantages of the transformers and scikit-learn toolkits. Methods In transformers-sklearn, three Python classes were implemented, namely, BERTologyClassifier for the classification task, BERTologyNERClassifier for the named entity recognition (NER) task, and BERTologyRegressor for the regression task. Each class contains three methods, i.e., fit for fine-tuning transformer-based models with the training dataset, score for evaluating the performance of the fine-tuned model, and predict for predicting the labels of the test dataset. transformers-sklearn is a user-friendly toolkit that (1) Is customizable via a few parameters (e.g., model_name_or_path and model_type), (2) Supports multilingual NLP tasks, and (3) Requires less coding. The input data format is automatically generated by transformers-sklearn with the annotated corpus. Newcomers only need to prepare the dataset. The model framework and training methods are predefined in transformers-sklearn. Results We collected four open-source medical language datasets, including TrialClassification for Chinese medical trial text multi label classification, BC5CDR for English biomedical text name entity recognition, DiabetesNER for Chinese diabetes entity recognition and BIOSSES for English biomedical sentence similarity estimation. In the four medical NLP tasks, the average code size of our script is 45 lines/task, which is one-sixth the size of transformers’ script. The experimental results show that transformers-sklearn based on pretrained BERT models achieved macro F1 scores of 0.8225, 0.8703 and 0.6908, respectively, on the TrialClassification, BC5CDR and DiabetesNER tasks and a Pearson correlation of 0.8260 on the BIOSSES task, which is consistent with the results of transformers. Conclusions The proposed toolkit could help newcomers address medical language understanding tasks using the scikit-learn coding style easily. The code and tutorials of transformers-sklearn are available at https://doi.org/10.5281/zenodo.4453803. In future, more medical language understanding tasks will be supported to improve the applications of transformers_sklearn.

Download Full-text

Multi-level Chunk-based Constituent-to-Dependency Treebank Transformation for Tibetan Dependency Parsing

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3424247 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-12

Author(s):

Shumin Shi ◽

Dan Luo ◽

Xing Wu ◽

Congjun Long ◽

Heyan Huang

Keyword(s):

Language Processing ◽

Manual Annotation ◽

Syntactic Parsing ◽

Dependency Parsing ◽

Low Resource ◽

Resource Setting ◽

Dependency Tree ◽

Low Resource Setting ◽

Novel Method ◽

Multi Level

Dependency parsing is an important task for Natural Language Processing (NLP). However, a mature parser requires a large treebank for training, which is still extremely costly to create. Tibetan is a kind of extremely low-resource language for NLP, there is no available Tibetan dependency treebank, which is currently obtained by manual annotation. Furthermore, there are few related kinds of research on the construction of treebank. We propose a novel method of multi-level chunk-based syntactic parsing to complete constituent-to-dependency treebank conversion for Tibetan under scarce conditions. Our method mines more dependencies of Tibetan sentences, builds a high-quality Tibetan dependency tree corpus, and makes fuller use of the inherent laws of the language itself. We train the dependency parsing models on the dependency treebank obtained by the preliminary transformation. The model achieves 86.5% accuracy, 96% LAS, and 97.85% UAS, which exceeds the optimal results of existing conversion methods. The experimental results show that our method has the potential to use a low-resource setting, which means we not only solve the problem of scarce Tibetan dependency treebank but also avoid needless manual annotation. The method embodies the regularity of strong knowledge-guided linguistic analysis methods, which is of great significance to promote the research of Tibetan information processing.

Download Full-text

Two-Stage Mask-RCNN Approach for Detecting and Segmenting the Optic Nerve Head, Optic Disc, and Optic Cup in Fundus Images

Applied Sciences ◽

10.3390/app10113833 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3833 ◽

Cited By ~ 3

Author(s):

Haidar Almubarak ◽

Yakoub Bazi ◽

Naif Alajlan

Keyword(s):

Optic Nerve ◽

Optic Disc ◽

Optic Nerve Head ◽

Fine Tuning ◽

Fundus Images ◽

Two Stage ◽

Second Stage ◽

Retinal Fundus Images ◽

Tuning Strategy ◽

Retinal Fundus

In this paper, we propose a method for localizing the optic nerve head and segmenting the optic disc/cup in retinal fundus images. The approach is based on a simple two-stage Mask-RCNN compared to sophisticated methods that represent the state-of-the-art in the literature. In the first stage, we detect and crop around the optic nerve head then feed the cropped image as input for the second stage. The second stage network is trained using a weighted loss to produce the final segmentation. To further improve the detection in the first stage, we propose a new fine-tuning strategy by combining the cropping output of the first stage with the original training image to train a new detection network using different scales for the region proposal network anchors. We evaluate the method on Retinal Fundus Images for Glaucoma Analysis (REFUGE), Magrabi, and MESSIDOR datasets. We used the REFUGE training subset to train the models in the proposed method. Our method achieved 0.0430 mean absolute error in the vertical cup-to-disc ratio (MAE vCDR) on the REFUGE test set compared to 0.0414 obtained using complex and multiple ensemble networks methods. The models trained with the proposed method transfer well to datasets outside REFUGE, achieving a MAE vCDR of 0.0785 and 0.077 on MESSIDOR and Magrabi datasets, respectively, without being retrained. In terms of detection accuracy, the proposed new fine-tuning strategy improved the detection rate from 96.7% to 98.04% on MESSIDOR and from 93.6% to 100% on Magrabi datasets compared to the reported detection rates in the literature.

Download Full-text

On-device Prior Knowledge Incorporated Learning for Personalized Atrial Fibrillation Detection

ACM Transactions on Embedded Computing Systems ◽

10.1145/3476987 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-25

Author(s):

Zhenge Jia ◽

Yiyu Shi ◽

Samir Saba ◽

Jingtong Hu

Keyword(s):

Atrial Fibrillation ◽

Prior Knowledge ◽

Domain Knowledge ◽

Fine Tuning ◽

Cardiac Monitoring ◽

Patient Specific ◽

Detection Accuracy ◽

Specific Patient ◽

Deep Model ◽

Tuning Strategy

Atrial Fibrillation (AF), one of the most prevalent arrhythmias, is an irregular heart-rate rhythm causing serious health problems such as stroke and heart failure. Deep learning based methods have been exploited to provide an end-to-end AF detection by automatically extracting features from Electrocardiogram (ECG) signal and achieve state-of-the-art results. However, the pre-trained models cannot adapt to each patient’s rhythm due to the high variability of rhythm characteristics among different patients. Furthermore, the deep models are prone to overfitting when fine-tuned on the limited ECG of the specific patient for personalization. In this work, we propose a prior knowledge incorporated learning method to effectively personalize the model for patient-specific AF detection and alleviate the overfitting problems. To be more specific, a prior-incorporated portion importance mechanism is proposed to enforce the network to learn to focus on the targeted portion of the ECG, following the cardiologists’ domain knowledge in recognizing AF. A prior-incorporated regularization mechanism is further devised to alleviate model overfitting during personalization by regularizing the fine-tuning process with feature priors on typical AF rhythms of the general population. The proposed personalization method embeds the well-defined prior knowledge in diagnosing AF rhythm into the personalization procedure, which improves the personalized deep model and eliminates the workload of manually adjusting parameters in conventional AF detection method. The prior knowledge incorporated personalization is feasibly and semi-automatically conducted on the edge, device of the cardiac monitoring system. We report an average AF detection accuracy of 95.3% of three deep models over patients, surpassing the pre-trained model by a large margin of 11.5% and the fine-tuning strategy by 8.6%.

Download Full-text

EventEpi–A Natural Language Processing Framework for Event-Based Surveillance

10.1101/19006395 ◽

2019 ◽

Author(s):

Auss Abbood ◽

Alexander Ullrich ◽

Rüdiger Busche ◽

Stéphane Ghozzi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Application ◽

Fine Tuning ◽

Entity Recognition ◽

World Health ◽

Support Vector ◽

Event Based ◽

Processing Framework

AbstractAccording to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of epidemiologists sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural-language-processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We trained a naive Bayes classifier to find the single most likely one using RKI’s EBS database as labels. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using document and word embeddings. Two of the tested algorithms stood out: The multilayer perceptron performed best overall, with a precision of 0.19, recall of 0.50, specificity of 0.89, F1 of 0.28, and the highest tested index balanced accuracy of 0.46. The support-vector machine, on the other hand, had the highest recall (0.88) which can be of higher interest for epidemiologists. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code is publicly available at https://github.com/aauss/EventEpi.

Download Full-text

Self-Supervised Pre-Training of Transformers for Satellite Image Time Series Classification

10.36227/techrxiv.13025039.v1 ◽

2020 ◽

Author(s):

Yuan Yuan ◽

Lei Lin

Keyword(s):

Time Series ◽

Deep Learning ◽

Large Scale ◽

Temporal Structure ◽

Satellite Image ◽

Fine Tuning ◽

Small Scale ◽

Model Parameters ◽

Learning Approaches ◽

Wide Range

Satellite image time series (SITS) classification is a major research topic in remote sensing and is relevant for a wide range of applications. Deep learning approaches have been commonly employed for SITS classification and have provided state-of-the-art performance. However, deep learning methods suffer from overfitting when labeled data is scarce. To address this problem, we propose a novel self-supervised pre-training scheme to initialize a Transformer-based network by utilizing large-scale unlabeled data. In detail, the model is asked to predict randomly contaminated observations given an entire time series of a pixel. The main idea of our proposal is to leverage the inherent temporal structure of satellite time series to learn general-purpose spectral-temporal representations related to land cover semantics. Once pre-training is completed, the pre-trained network can be further adapted to various SITS classification tasks by fine-tuning all the model parameters on small-scale task-related labeled data. In this way, the general knowledge and representations about SITS can be transferred to a label-scarce task, thereby improving the generalization performance of the model as well as reducing the risk of overfitting. Comprehensive experiments have been carried out on three benchmark datasets over large study areas. Experimental results demonstrate the effectiveness of the proposed method, leading to a classification accuracy increment up to 1.91% to 6.69%. <div><b>This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.</b></div>

Download Full-text

Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations

10.1101/035873 ◽

2016 ◽

Cited By ~ 2

Author(s):

Kassian Kobert ◽

Alexandros Stamatakis ◽

Tomáš Flouri

Keyword(s):

Evolutionary Biology ◽

Likelihood Function ◽

Simulated Data ◽

Evolutionary Model ◽

Identical Result ◽

Model Parameters ◽

Data Sets ◽

Efficient Detection ◽

Novel Method ◽

Computational Bottleneck

The phylogenetic likelihood function is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory saving attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 10-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the phylogenetic likelihood function currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.

Download Full-text

Lifelong Zero-Shot Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/77 ◽

2020 ◽

Author(s):

Kun Wei ◽

Cheng Deng ◽

Xu Yang

Keyword(s):

Fine Tuning ◽

Training Set ◽

Continuous Training ◽

Evaluation Protocol ◽

Multiple Datasets ◽

Real World Applications ◽

Knowledge Distillation ◽

Novel Method ◽

Previous Training ◽

Current Stage

Zero-Shot Learning (ZSL) handles the problem that some testing classes never appear in training set. Existing ZSL methods are designed for learning from a fixed training set, which do not have the ability to capture and accumulate the knowledge of multiple training sets, causing them infeasible to many real-world applications. In this paper, we propose a new ZSL setting, named as Lifelong Zero-Shot Learning (LZSL), which aims to accumulate the knowledge during the learning from multiple datasets and recognize unseen classes of all trained datasets. Besides, a novel method is conducted to realize LZSL, which effectively alleviates the Catastrophic Forgetting in the continuous training process. Specifically, considering those datasets containing different semantic embeddings, we utilize Variational Auto-Encoder to obtain unified semantic representations. Then, we leverage selective retraining strategy to preserve the trained weights of previous tasks and avoid negative transfer when fine-tuning the entire model. Finally, knowledge distillation is employed to transfer knowledge from previous training stages to current stage. We also design the LZSL evaluation protocol and the challenging benchmarks. Extensive experiments on these benchmarks indicate that our method tackles LZSL problem effectively, while existing ZSL methods fail.

Download Full-text