Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics

AbstractDeep learning models can accurately map genomic DNA sequences to associated functional molecular readouts such as protein–DNA binding data. Base-resolution importance (i.e. “attribution”) scores inferred from these models can highlight predictive sequence motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors trained on genome-wide binary or continuous molecular profiles. We show that our attribution prior dramatically improves models’ stability, interpretability, and performance on held-out data, especially when training data is severely limited. Our attribution prior also allows models to identify biologically meaningful sequence motifs more sensitively and precisely within individual regulatory elements. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. This work represents an important advancement in improving the reliability of deep learning models for deciphering the regulatory code of the genome.

Download Full-text

Domain randomization-enhanced deep learning models for bird detection

Scientific Reports ◽

10.1038/s41598-020-80101-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Xin Mao ◽

Jun Kang Chow ◽

Pin Siang Tan ◽

Kuan-fu Liu ◽

Jimmy Wu ◽

...

Keyword(s):

Deep Learning ◽

Continuous Monitoring ◽

Bird Species ◽

Training Data ◽

Learning Models ◽

Fine Grained ◽

Bird Detection ◽

Relationship Of ◽

The Relationship

AbstractAutomatic bird detection in ornithological analyses is limited by the accuracy of existing models, due to the lack of training data and the difficulties in extracting the fine-grained features required to distinguish bird species. Here we apply the domain randomization strategy to enhance the accuracy of the deep learning models in bird detection. Trained with virtual birds of sufficient variations in different environments, the model tends to focus on the fine-grained features of birds and achieves higher accuracies. Based on the 100 terabytes of 2-month continuous monitoring data of egrets, our results cover the findings using conventional manual observations, e.g., vertical stratification of egrets according to body size, and also open up opportunities of long-term bird surveys requiring intensive monitoring that is impractical using conventional methods, e.g., the weather influences on egrets, and the relationship of the migration schedules between the great egrets and little egrets.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

U-Infuse: Democratization of Customizable Deep Learning for Object Detection

Sensors ◽

10.3390/s21082611 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2611

Author(s):

Andrew Shepley ◽

Greg Falzon ◽

Christopher Lawson ◽

Paul Meek ◽

Paul Kwan

Keyword(s):

Deep Learning ◽

Intellectual Property ◽

Object Detection ◽

Image Data ◽

Learning Technologies ◽

Training Data ◽

Learning Models ◽

Ecological Data ◽

Single Class ◽

Large Numbers

Image data is one of the primary sources of ecological data used in biodiversity conservation and management worldwide. However, classifying and interpreting large numbers of images is time and resource expensive, particularly in the context of camera trapping. Deep learning models have been used to achieve this task but are often not suited to specific applications due to their inability to generalise to new environments and inconsistent performance. Models need to be developed for specific species cohorts and environments, but the technical skills required to achieve this are a key barrier to the accessibility of this technology to ecologists. Thus, there is a strong need to democratize access to deep learning technologies by providing an easy-to-use software application allowing non-technical users to train custom object detectors. U-Infuse addresses this issue by providing ecologists with the ability to train customised models using publicly available images and/or their own images without specific technical expertise. Auto-annotation and annotation editing functionalities minimize the constraints of manually annotating and pre-processing large numbers of images. U-Infuse is a free and open-source software solution that supports both multiclass and single class training and object detection, allowing ecologists to access deep learning technologies usually only available to computer scientists, on their own device, customised for their application, without sharing intellectual property or sensitive data. It provides ecological practitioners with the ability to (i) easily achieve object detection within a user-friendly GUI, generating a species distribution report, and other useful statistics, (ii) custom train deep learning models using publicly available and custom training data, (iii) achieve supervised auto-annotation of images for further training, with the benefit of editing annotations to ensure quality datasets. Broad adoption of U-Infuse by ecological practitioners will improve ecological image analysis and processing by allowing significantly more image data to be processed with minimal expenditure of time and resources, particularly for camera trap images. Ease of training and use of transfer learning means domain-specific models can be trained rapidly, and frequently updated without the need for computer science expertise, or data sharing, protecting intellectual property and privacy.

Download Full-text

A Physics-Infused Deep Learning Model for the Prediction of Refractive Indices and Its Use for the Large-Scale Screening of Organic Compound Space

10.26434/chemrxiv.8796950 ◽

2019 ◽

Author(s):

Mojtaba Haghighatlari ◽

Gaurav Vishwakarma ◽

Mohammad Atif Faiz Afzal ◽

Johannes Hachmann

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Organic Molecules ◽

Learning Model ◽

Training Data ◽

Refractive Indices ◽

Learning Models ◽

Deep Learning Model ◽

Machine Learning Models

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>

Download Full-text

Batch Effect Removal via Batch-Free Encoding

10.1101/380816 ◽

2018 ◽

Cited By ~ 1

Author(s):

Uri Shaham

Keyword(s):

Deep Learning ◽

Biological Properties ◽

Batch Effect ◽

Training Data ◽

Rna Seq ◽

Batch Effects ◽

Training Time ◽

Learning Techniques ◽

Downstream Analysis ◽

Biological Patterns

AbstractBiological measurements often contain systematic errors, also known as “batch effects”, which may invalidate downstream analysis when not handled correctly. The problem of removing batch effects is of major importance in the biological community. Despite recent advances in this direction via deep learning techniques, most current methods may not fully preserve the true biological patterns the data contains. In this work we propose a deep learning approach for batch effect removal. The crux of our approach is learning a batch-free encoding of the data, representing its intrinsic biological properties, but not batch effects. In addition, we also encode the systematic factors through a decoding mechanism and require accurate reconstruction of the data. Altogether, this allows us to fully preserve the true biological patterns represented in the data. Experimental results are reported on data obtained from two high throughput technologies, mass cytometry and single-cell RNA-seq. Beyond good performance on training data, we also observe that our system performs well on test data obtained from new patients, which was not available at training time. Our method is easy to handle, a publicly available code can be found at https://github.com/ushaham/BatchEffectRemoval2018.

Download Full-text

SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models

10.1101/2020.05.13.093997 ◽

2020 ◽

Author(s):

Yupeng Wang ◽

Rosario B. Jaime-Lara ◽

Abhrarup Roy ◽

Ying Sun ◽

Xinyue Liu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Dna Sequences ◽

Cell Types ◽

Learning Models ◽

Cell Type ◽

Coding Sequences ◽

Sequence Features ◽

Cell Type Specific ◽

Different Cell Types

AbstractWe propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, sequential k-mer (k=5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers including gkm-SVM and DanQ, with regard to distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL is able to directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified according to their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL.

Download Full-text

Using Syntactic Similarity to Shorten the Training Time of Deep Learning Models using Time Series Datasets: A Case Study

10.5220/0010515700002996 ◽

2021 ◽

Author(s):

Silvestre Malta ◽

Pedro Pinto ◽

Manuel Veiga

Keyword(s):

Time Series ◽

Deep Learning ◽

Learning Models ◽

Training Time ◽

Syntactic Similarity

Download Full-text

RECONSTRUCTED TEETH IMAGE FROM BRACES WITH GAN

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237221500435 ◽

2021 ◽

pp. 2150043

Author(s):

Vu Tuan Hai ◽

Dang Thanh Vu ◽

Huynh Ho Thi Mong Trinh ◽

Pham The Bao

Keyword(s):

Health Care ◽

Deep Learning ◽

Training Data ◽

Learning Models ◽

Paired Data ◽

Object Removal ◽

Interactive Tool ◽

Learning Techniques ◽

Large Context ◽

Image Translation

Recent advances in deep learning models have shown promising potential in object removal, which is the task of replacing undesired objects with appropriate pixel values using known context. Object removal-based deep learning can commonly be solved by modeling it as the Img2Img (image to image) translation or Inpainting. Instead of dealing with a large context, this paper aims at a specific application of object removal, that is, erasing braces trace out of an image having teeth with braces (called braces2teeth problem). We solved the problem by three methods corresponding to different datasets. Firstly, we use the CycleGAN model to deal with the problem that paired training data is not available. In the second case, we try to create pseudo-paired data to train the Pix2Pix model. In the last case, we utilize GraphCut combining generative inpainting model to build a user-interactive tool that can improve the result in case the user is not satisfied with previous results. To our best knowledge, this study is one of the first attempts to take the braces2teeth problem into account by using deep learning techniques and it can be applied in various fields, from health care to entertainment.

Download Full-text

Leveraging Natural Language Processing Applications Using Machine Learning

Handbook of Research on Emerging Trends and Applications of Machine Learning - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9643-1.ch016 ◽

2020 ◽

pp. 338-360

Author(s):

Janjanam Prabhudas ◽

C. H. Pradeep Reddy

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Text Summarization ◽

Feature Representation ◽

Learning Models ◽

Primary Focus ◽

And Performance

The enormous increase of information along with the computational abilities of machines created innovative applications in natural language processing by invoking machine learning models. This chapter will project the trends of natural language processing by employing machine learning and its models in the context of text summarization. This chapter is organized to make the researcher understand technical perspectives regarding feature representation and their models to consider before applying on language-oriented tasks. Further, the present chapter revises the details of primary models of deep learning, its applications, and performance in the context of language processing. The primary focus of this chapter is to illustrate the technical research findings and gaps of text summarization based on deep learning along with state-of-the-art deep learning models for TS.

Download Full-text

The Effects of Feature Optimization on High-Dimensional Essay Data

Mathematical Problems in Engineering ◽

10.1155/2015/421642 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Bong-Jun Yi ◽

Do-Gil Lee ◽

Hae-Chang Rim

Keyword(s):

Poor Performance ◽

Feature Space ◽

Optimization Techniques ◽

Training Data ◽

High Dimensional ◽

Automated Essay Scoring ◽

Training Time ◽

Training And Performance Improvement ◽

Feature Optimization ◽

And Performance

Current machine learning (ML) based automated essay scoring (AES) systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.

Download Full-text