CausaLM: Causal Model Explanation Through Counterfactual Language Models

Abstract Understanding predictions made by deep neural networks is notoriously difficult, but also crucial to their dissemination. As all ML-based methods, they are as good as their training data, and can also capture unwanted biases. While there are tools that can help understand whether such biases exist, they do not distinguish between correlation and causation, and might be ill-suited for text-based models and for reasoning about high level language concepts. A key problem of estimating the causal effect of a concept of interest on a given model is that this estimation requires the generation of counterfactual examples, which is challenging with existing generation technology. To bridge that gap, we propose CausaLM, a framework for producing causal model explanations using counterfactual language representation models. Our approach is based on fine-tuning of deep contextualized embedding models with auxiliary adversarial tasks derived from the causal graph of the problem. Concretely, we show that by carefully choosing auxiliary adversarial pre-training tasks, language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest, and be used to estimate its true causal effect on model performance. A byproduct of our method is a language representation model that is unaffected by the tested concept, which can be useful in mitigating unwanted bias ingrained in the data.

Download Full-text

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Computational Linguistics ◽

10.1162/coli_a_00405 ◽

2021 ◽

pp. 1-55

Author(s):

Daniel Loureiro ◽

Kiamehr Rezaee ◽

Mohammad Taher Pilehvar ◽

Jose Camacho-Collados

Keyword(s):

Feature Extraction ◽

Word Sense Disambiguation ◽

Language Model ◽

Training Data ◽

Fine Tuning ◽

Language Models ◽

Coarse Grained ◽

Word Sense ◽

Sense Disambiguation ◽

High Level

Abstract Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.

Download Full-text

Closed-Loop Memory GAN for Continual Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/462 ◽

2019 ◽

Cited By ~ 1

Author(s):

Amanda Rios ◽

Laurent Itti

Keyword(s):

Gradient Descent ◽

Closed Loop ◽

Model Performance ◽

Generative Models ◽

Memory Storage ◽

Training Data ◽

Fine Tuning ◽

Long Term Memory ◽

Memory Unit ◽

Dynamic Memory

Sequential learning of tasks using gradient descent leads to an unremitting decline in the accuracy of tasks for which training data is no longer available, termed catastrophic forgetting. Generative models have been explored as a means to approximate the distribution of old tasks and bypass storage of real data. Here we propose a cumulative closed-loop memory replay GAN (CloGAN) provided with external regularization by a small memory unit selected for maximum sample diversity. We evaluate incremental class learning using a notoriously hard paradigm, single-headed learning, in which each task is a disjoint subset of classes in the overall dataset, and performance is evaluated on all previous classes. First, we show that when constructing a dynamic memory unit to preserve sample heterogeneity, model performance asymptotically approaches training on the full dataset. We then show that using a stochastic generator to continuously output fresh new images during training increases performance significantly further meanwhile generating quality images. We compare our approach to several baselines including fine-tuning by gradient descent (FGD), Elastic Weight Consolidation (EWC), Deep Generative Replay (DGR) and Memory Replay GAN (MeRGAN). Our method has very low long-term memory cost, the memory unit, as well as negligible intermediate memory storage.

Download Full-text

Mitigating Class-Boundary Label Uncertainty to Reduce Both Model Bias and Variance

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3429447 ◽

2021 ◽

Vol 15 (2) ◽

pp. 1-18

Author(s):

Matthew Almeida ◽

Yong Zhuang ◽

Wei Ding ◽

Scott E. Crouter ◽

Ping Chen

Keyword(s):

Learning Algorithm ◽

Model Performance ◽

Training Data ◽

Fine Tuning ◽

Classification Model ◽

Model Complexity ◽

Decision Boundary ◽

Model Bias ◽

New Approach ◽

Unseen Data

The study of model bias and variance with respect to decision boundaries is critically important in supervised learning and artificial intelligence. There is generally a tradeoff between the two, as fine-tuning of the decision boundary of a classification model to accommodate more boundary training samples (i.e., higher model complexity) may improve training accuracy (i.e., lower bias) but hurt generalization against unseen data (i.e., higher variance). By focusing on just classification boundary fine-tuning and model complexity, it is difficult to reduce both bias and variance. To overcome this dilemma, we take a different perspective and investigate a new approach to handle inaccuracy and uncertainty in the training data labels, which are inevitable in many applications where labels are conceptual entities and labeling is performed by human annotators. The process of classification can be undermined by uncertainty in the labels of the training data; extending a boundary to accommodate an inaccurately labeled point will increase both bias and variance. Our novel method can reduce both bias and variance by estimating the pointwise label uncertainty of the training set and accordingly adjusting the training sample weights such that those samples with high uncertainty are weighted down and those with low uncertainty are weighted up. In this way, uncertain samples have a smaller contribution to the objective function of the model’s learning algorithm and exert less pull on the decision boundary. In a real-world physical activity recognition case study, the data present many labeling challenges, and we show that this new approach improves model performance and reduces model variance.

Download Full-text

Effect of data-augmentation on fine-tuned CNN model performance

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp84-92 ◽

2021 ◽

Vol 10 (1) ◽

pp. 84

Author(s):

Ramaprasad Poojary ◽

Roma Raina ◽

Amit Kumar Mondal

Keyword(s):

Neural Network ◽

Computer Vision ◽

Deep Learning ◽

High Performance ◽

Data Augmentation ◽

Model Performance ◽

Training Data ◽

Fine Tuning ◽

Test Accuracy ◽

Training Time

<span id="docs-internal-guid-cdb76bbb-7fff-978d-961c-e21c41807064"><span>During the last few years, deep learning achieved remarkable results in the field of machine learning when used for computer vision tasks. Among many of its architectures, deep neural network-based architecture known as convolutional neural networks are recently used widely for image detection and classification. Although it is a great tool for computer vision tasks, it demands a large amount of training data to yield high performance. In this paper, the data augmentation method is proposed to overcome the challenges faced due to a lack of insufficient training data. To analyze the effect of data augmentation, the proposed method uses two convolutional neural network architectures. To minimize the training time without compromising accuracy, models are built by fine-tuning pre-trained networks VGG16 and ResNet50. To evaluate the performance of the models, loss functions and accuracies are used. Proposed models are constructed using Keras deep learning framework and models are trained on a custom dataset created from Kaggle CAT vs DOG database. Experimental results showed that both the models achieved better test accuracy when data augmentation is employed, and model constructed using ResNet50 outperformed VGG16 based model with a test accuracy of 90% with data augmentation & 82% without data augmentation.</span></span>

Download Full-text

Transfer Learning Based Traffic Sign Recognition Using Inception-v3 Model

Periodica Polytechnica Transportation Engineering ◽

10.3311/pptr.11480 ◽

2018 ◽

Vol 47 (3) ◽

pp. 242-250 ◽

Cited By ~ 3

Author(s):

Chunmian Lin ◽

Lin Li ◽

Wenting Luo ◽

Kelvin C. P. Wang ◽

Jiangang Guo

Keyword(s):

Transfer Learning ◽

Recognition Performance ◽

Processing Technique ◽

Learning Rate ◽

Training Data ◽

Fine Tuning ◽

Traffic Sign Recognition ◽

Traffic Sign ◽

Sign Recognition ◽

High Level

Traffic sign recognition is critical for advanced driver assistant system and road infrastructure survey. Traditional traffic sign recognition algorithms can't efficiently recognize traffic signs due to its limitation, yet deep learning-based technique requires huge amount of training data before its use, which is time consuming and labor intensive. In this study, transfer learning-based method is introduced for traffic sign recognition and classification, which significantly reduces the amount of training data and alleviates computation expense using Inception-v3 model. In our experiment, Belgium Traffic Sign Database is chosen and augmented by data pre-processing technique. Subsequently the layer-wise features extracted using different convolution and pooling operations are compared and analyzed. Finally transfer learning-based model is repetitively retrained several times with fine-tuning parameters at different learning rate, and excellent reliability and repeatability are observed based on statistical analysis. The results show that transfer learning model can achieve a high-level recognition performance in traffic sign recognition, which is up to 99.18 % of recognition accuracy at 0.05 learning rate (average accuracy of 99.09 %). This study would be beneficial in other traffic infrastructure recognition such as road lane marking and roadside protection facilities, and so on.

Download Full-text

Deep Learning-Based Differentiation between Mucinous Cystic Neoplasm and Serous Cystic Neoplasm in the Pancreas Using Endoscopic Ultrasonography

Diagnostics ◽

10.3390/diagnostics11061052 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1052

Author(s):

Leang Sim Nguon ◽

Kangwon Seo ◽

Jung-Hyun Lim ◽

Tae-Jun Song ◽

Sung-Hyun Cho ◽

...

Keyword(s):

Decision Making ◽

Deep Learning ◽

Network Model ◽

Endoscopic Ultrasonography ◽

Data Augmentation ◽

Clinical Information ◽

Training Data ◽

Fine Tuning ◽

Cystic Neoplasm ◽

Cystic Neoplasms

Mucinous cystic neoplasms (MCN) and serous cystic neoplasms (SCN) account for a large portion of solitary pancreatic cystic neoplasms (PCN). In this study we implemented a convolutional neural network (CNN) model using ResNet50 to differentiate between MCN and SCN. The training data were collected retrospectively from 59 MCN and 49 SCN patients from two different hospitals. Data augmentation was used to enhance the size and quality of training datasets. Fine-tuning training approaches were utilized by adopting the pre-trained model from transfer learning while training selected layers. Testing of the network was conducted by varying the endoscopic ultrasonography (EUS) image sizes and positions to evaluate the network performance for differentiation. The proposed network model achieved up to 82.75% accuracy and a 0.88 (95% CI: 0.817–0.930) area under curve (AUC) score. The performance of the implemented deep learning networks in decision-making using only EUS images is comparable to that of traditional manual decision-making using EUS images along with supporting clinical information. Gradient-weighted class activation mapping (Grad-CAM) confirmed that the network model learned the features from the cyst region accurately. This study proves the feasibility of diagnosing MCN and SCN using a deep learning network model. Further improvement using more datasets is needed.

Download Full-text

Genome-Wide Metabolic Reconstruction of the Synthesis of Polyhydroxyalkanoates from Sugars and Fatty Acids by Burkholderia Sensu Lato Species

Microorganisms ◽

10.3390/microorganisms9061290 ◽

2021 ◽

Vol 9 (6) ◽

pp. 1290

Author(s):

Natalia Alvarez-Santullano ◽

Pamela Villegas ◽

Mario Sepúlveda Mardones ◽

Roberto E. Durán ◽

Raúl Donoso ◽

...

Keyword(s):

Fatty Acids ◽

De Novo ◽

Reducing Power ◽

Pentose Phosphate ◽

Fine Tuning ◽

Metabolic Reconstruction ◽

Phylogenetic Groups ◽

Outlier Group ◽

Pha Genes ◽

High Level

Burkholderia sensu lato (s.l.) species have a versatile metabolism. The aims of this review are the genomic reconstruction of the metabolic pathways involved in the synthesis of polyhydroxyalkanoates (PHAs) by Burkholderia s.l. genera, and the characterization of the PHA synthases and the pha genes organization. The reports of the PHA synthesis from different substrates by Burkholderia s.l. strains were reviewed. Genome-guided metabolic reconstruction involving the conversion of sugars and fatty acids into PHAs by 37 Burkholderia s.l. species was performed. Sugars are metabolized via the Entner–Doudoroff (ED), pentose-phosphate (PP), and lower Embden–Meyerhoff–Parnas (EMP) pathways, which produce reducing power through NAD(P)H synthesis and PHA precursors. Fatty acid substrates are metabolized via β-oxidation and de novo synthesis of fatty acids into PHAs. The analysis of 194 Burkholderia s.l. genomes revealed that all strains have the phaC, phaA, and phaB genes for PHA synthesis, wherein the phaC gene is generally present in ≥2 copies. PHA synthases were classified into four phylogenetic groups belonging to class I II and III PHA synthases and one outlier group. The reconstruction of PHAs synthesis revealed a high level of gene redundancy probably reflecting complex regulatory layers that provide fine tuning according to diverse substrates and physiological conditions.

Download Full-text

New polyp image classification technique using transfer learning of network-in-network structure in endoscopic images

Scientific Reports ◽

10.1038/s41598-021-83199-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Young Jae Kim ◽

Jang Pyo Bae ◽

Jun-Won Chung ◽

Dong Kyun Park ◽

Kwang Gi Kim ◽

...

Keyword(s):

Colorectal Cancer ◽

Transfer Learning ◽

Test Data ◽

State Of The Art ◽

Early Stage ◽

Statistical Significance ◽

Recall Rate ◽

Training Data ◽

Fine Tuning ◽

Accuracy Evaluation

AbstractWhile colorectal cancer is known to occur in the gastrointestinal tract. It is the third most common form of cancer of 27 major types of cancer in South Korea and worldwide. Colorectal polyps are known to increase the potential of developing colorectal cancer. Detected polyps need to be resected to reduce the risk of developing cancer. This research improved the performance of polyp classification through the fine-tuning of Network-in-Network (NIN) after applying a pre-trained model of the ImageNet database. Random shuffling is performed 20 times on 1000 colonoscopy images. Each set of data are divided into 800 images of training data and 200 images of test data. An accuracy evaluation is performed on 200 images of test data in 20 experiments. Three compared methods were constructed from AlexNet by transferring the weights trained by three different state-of-the-art databases. A normal AlexNet based method without transfer learning was also compared. The accuracy of the proposed method was higher in statistical significance than the accuracy of four other state-of-the-art methods, and showed an 18.9% improvement over the normal AlexNet based method. The area under the curve was approximately 0.930 ± 0.020, and the recall rate was 0.929 ± 0.029. An automatic algorithm can assist endoscopists in identifying polyps that are adenomatous by considering a high recall rate and accuracy. This system can enable the timely resection of polyps at an early stage.

Download Full-text

Performance Evaluation of Deep CNN-Based Crack Detection and Localization Techniques for Concrete Structures

Sensors ◽

10.3390/s21051688 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1688

Author(s):

Luqman Ali ◽

Fady Alnajjar ◽

Hamad Al Jassmi ◽

Munkhjargal Gochoo ◽

Wasif Khan ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Crack Detection ◽

Concrete Structures ◽

Model Performance ◽

Training Data ◽

Computational Time ◽

Data Heterogeneity ◽

Public Datasets ◽

Detection And Localization

This paper proposes a customized convolutional neural network for crack detection in concrete structures. The proposed method is compared to four existing deep learning methods based on training data size, data heterogeneity, network complexity, and the number of epochs. The performance of the proposed convolutional neural network (CNN) model is evaluated and compared to pretrained networks, i.e., the VGG-16, VGG-19, ResNet-50, and Inception V3 models, on eight datasets of different sizes, created from two public datasets. For each model, the evaluation considered computational time, crack localization results, and classification measures, e.g., accuracy, precision, recall, and F1-score. Experimental results demonstrated that training data size and heterogeneity among data samples significantly affect model performance. All models demonstrated promising performance on a limited number of diverse training data; however, increasing the training data size and reducing diversity reduced generalization performance, and led to overfitting. The proposed customized CNN and VGG-16 models outperformed the other methods in terms of classification, localization, and computational time on a small amount of data, and the results indicate that these two models demonstrate superior crack detection and localization for concrete structures.

Download Full-text

Train Fast While Reducing False Positives: Improving Animal Classification Performance Using Convolutional Neural Networks

Geomatics ◽

10.3390/geomatics1010004 ◽

2021 ◽

Vol 1 (1) ◽

pp. 34-49

Author(s):

Mael Moreni ◽

Jerome Theau ◽

Samuel Foucher

Keyword(s):

False Positive ◽

Classification Performance ◽

Large Datasets ◽

False Positives ◽

Fine Tuning ◽

Training Time ◽

In The Wild ◽

Test Sets ◽

High Level ◽

Time Decrease

The combination of unmanned aerial vehicles (UAV) with deep learning models has the capacity to replace manned aircrafts for wildlife surveys. However, the scarcity of animals in the wild often leads to highly unbalanced, large datasets for which even a good detection method can return a large amount of false detections. Our objectives in this paper were to design a training method that would reduce training time, decrease the number of false positives and alleviate the fine-tuning effort of an image classifier in a context of animal surveys. We acquired two highly unbalanced datasets of deer images with a UAV and trained a Resnet-18 classifier using hard-negative mining and a series of recent techniques. Our method achieved sub-decimal false positive rates on two test sets (1 false positive per 19,162 and 213,312 negatives respectively), while training on small but relevant fractions of the data. The resulting training times were therefore significantly shorter than they would have been using the whole datasets. This high level of efficiency was achieved with little tuning effort and using simple techniques. We believe this parsimonious approach to dealing with highly unbalanced, large datasets could be particularly useful to projects with either limited resources or extremely large datasets.

Download Full-text