scholarly journals Predicting Epigenomic Functions of Genetic Variants in the Context of Neurodevelopment via Deep Transfer Learning

2021 ◽  
Author(s):  
Boqiao Lai ◽  
Sheng Qian ◽  
Hanwen Zhang ◽  
Siwei Zhang ◽  
Alena Kozlova ◽  
...  

AbstractDecoding the regulatory effects of non-coding variants is a key challenge in understanding the mechanisms of gene regulation as well as the genetics of common diseases. Recently, deep learning models have been introduced to predict genome-wide epigenomic profiles and effects of DNA variants, in various cellular contexts, but they were often trained in cell lines or bulk tissues that may not be related to phenotypes of interest. This is particularly a challenge for neuropsychiatric disorders, since the most relevant cell and tissue types are often missing in the training data of such models.To address this issue, we introduce a deep transfer learning framework termed MetaChrom that takes advantage of both a reference dataset - an extensive compendium of publicly available epigenomic data, and epigenomic profiles of cell types related to specific phenotypes of interest. We trained and evaluated our model on a comprehensive set of epigenomic profiles from fetal and adult brain, and cellular models representing early neurodevelopment. MetaChrom predicts these epigenomic features with much higher accuracy than previous methods, and than models without the use of reference epigenomic data for transfer learning. Using experimentally determined regulatory variants from iPS cell-derived neurons, we show that MetaChrom predicts functional variants more accurately than existing non-coding variant scoring tools. By combining genome-wide association study (GWAS) data with MetaChrom predictions, we prioritized 31 SNPs for Schizophrenia (SCZ). These candidate SNPs suggest potential risk genes of SCZ and the biological contexts where they act.In summary, MetaChrom is a general transfer learning framework that can be applied to the study of regulatory functions of DNA sequences and variants in any disease-related cell or tissue types. The software tool is available at https://github.com/bl-2633/MetaChrom and a prediction web server is accessible at https://metachrom.ttic.edu/.

2018 ◽  
Vol 2018 ◽  
pp. 1-12 ◽  
Author(s):  
Ibrahim Hossain ◽  
Abbas Khosravi ◽  
Imali Hettiarachchi ◽  
Saeid Nahavandi

A widely discussed paradigm for brain-computer interface (BCI) is the motor imagery task using noninvasive electroencephalography (EEG) modality. It often requires long training session for collecting a large amount of EEG data which makes user exhausted. One of the approaches to shorten this session is utilizing the instances from past users to train the learner for the novel user. In this work, direct transferring from past users is investigated and applied to multiclass motor imagery BCI. Then, active learning (AL) driven informative instance transfer learning has been attempted for multiclass BCI. Informative instance transfer shows better performance than direct instance transfer which reaches the benchmark using a reduced amount of training data (49% less) in cases of 6 out of 9 subjects. However, none of these methods has superior performance for all subjects in general. To get a generic transfer learning framework for BCI, an optimal ensemble of informative and direct transfer methods is designed and applied. The optimized ensemble outperforms both direct and informative transfer method for all subjects except one in BCI competition IV multiclass motor imagery dataset. It achieves the benchmark performance for 8 out of 9 subjects using average 75% less training data. Thus, the requirement of large training data for the new user is reduced to a significant amount.


2021 ◽  
Vol 4 ◽  
Author(s):  
Ruqian Hao ◽  
Khashayar Namdar ◽  
Lin Liu ◽  
Farzad Khalvati

Brain tumor is one of the leading causes of cancer-related death globally among children and adults. Precise classification of brain tumor grade (low-grade and high-grade glioma) at an early stage plays a key role in successful prognosis and treatment planning. With recent advances in deep learning, artificial intelligence–enabled brain tumor grading systems can assist radiologists in the interpretation of medical images within seconds. The performance of deep learning techniques is, however, highly depended on the size of the annotated dataset. It is extremely challenging to label a large quantity of medical images, given the complexity and volume of medical data. In this work, we propose a novel transfer learning–based active learning framework to reduce the annotation cost while maintaining stability and robustness of the model performance for brain tumor classification. In this retrospective research, we employed a 2D slice–based approach to train and fine-tune our model on the magnetic resonance imaging (MRI) training dataset of 203 patients and a validation dataset of 66 patients which was used as the baseline. With our proposed method, the model achieved area under receiver operating characteristic (ROC) curve (AUC) of 82.89% on a separate test dataset of 66 patients, which was 2.92% higher than the baseline AUC while saving at least 40% of labeling cost. In order to further examine the robustness of our method, we created a balanced dataset, which underwent the same procedure. The model achieved AUC of 82% compared with AUC of 78.48% for the baseline, which reassures the robustness and stability of our proposed transfer learning augmented with active learning framework while significantly reducing the size of training data.


2022 ◽  
Vol 3 ◽  
Author(s):  
Yi Chang ◽  
Xin Jing ◽  
Zhao Ren ◽  
Björn W. Schuller

Since the COronaVIrus Disease 2019 (COVID-19) outbreak, developing a digital diagnostic tool to detect COVID-19 from respiratory sounds with computer audition has become an essential topic due to its advantages of being swift, low-cost, and eco-friendly. However, prior studies mainly focused on small-scale COVID-19 datasets. To build a robust model, the large-scale multi-sound FluSense dataset is utilised to help detect COVID-19 from cough sounds in this study. Due to the gap between FluSense and the COVID-19-related datasets consisting of cough only, the transfer learning framework (namely CovNet) is proposed and applied rather than simply augmenting the training data with FluSense. The CovNet contains (i) a parameter transferring strategy and (ii) an embedding incorporation strategy. Specifically, to validate the CovNet's effectiveness, it is used to transfer knowledge from FluSense to COUGHVID, a large-scale cough sound database of COVID-19 negative and COVID-19 positive individuals. The trained model on FluSense and COUGHVID is further applied under the CovNet to another two small-scale cough datasets for COVID-19 detection, the COVID-19 cough sub-challenge (CCS) database in the INTERSPEECH Computational Paralinguistics challengE (ComParE) challenge and the DiCOVA Track-1 database. By training four simple convolutional neural networks (CNNs) in the transfer learning framework, our approach achieves an absolute improvement of 3.57% over the baseline of DiCOVA Track-1 validation of the area under the receiver operating characteristic curve (ROC AUC) and an absolute improvement of 1.73% over the baseline of ComParE CCS test unweighted average recall (UAR).


2018 ◽  
Author(s):  
Akosua Busia ◽  
George E. Dahl ◽  
Clara Fannjiang ◽  
David H. Alexander ◽  
Elizabeth Dorfman ◽  
...  

AbstractMotivationInferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences.ResultsWe demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species thank-mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences.AvailabilityTensorFlow training code is available through GitHub (https://github.com/tensorflow/models/tree/master/research). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/)[email protected] informationSupplementary data are available in a separate document.


2020 ◽  
Author(s):  
Iason Katsamenis ◽  
Eftychios Protopapadakis ◽  
Athanasios Voulodimos ◽  
Anastasios Doulamis ◽  
Nikolaos Doulamis

We introduce a deep learning framework that can detect COVID-19 pneumonia in thoracic radiographs, as well as differentiate it from bacterial pneumonia infection. Deep classification models, such as convolutional neural networks (CNNs), require large-scale datasets in order to be trained and perform properly. Since the number of X-ray samples related to COVID-19 is limited, transfer learning (TL) appears as the go-to method to alleviate the demand for training data and develop accurate automated diagnosis models. In this context, networks are able to gain knowledge from pretrained networks on large-scale image datasets or alternative data-rich sources (i.e. bacterial and viral pneumonia radiographs). The experimental results indicate that the TL approach outperforms the performance obtained without TL, for the COVID-19 classification task in chest X-ray images.


2018 ◽  
Author(s):  
Andrew P. Anderson ◽  
Adam G. Jones

AbstractMotivationEstrogen response elements (EREs) are specific DNA sequences to which ligand-bound estrogen receptors (ERs) physically bind, allowing them to act as transcription factors for target genes. Locating EREs and ER responsive regions is therefore a potentially important component of the study of estrogen-regulated pathways.ResultsWe tested and demonstrated the ability of EREFinder, a novel algorithm we developed, to locate regions of ER-binding across the human genome and show that these regions designated by the program occur more frequently near estrogen responsive genes. EREFinder can handle large input files, has settings to allow for broad and narrow searches, and provides the full output to allow for greater data manipulation. These features facilitate a wide range of hypothesis testing for researchers and make EREFinder an excellent tool to aid in estrogen-related research.Availability and ImplementationSource code and binaries freely available for download at https://github.com/JonesLabIdaho/EREfinder, implemented in C++ and supported on Linux and MS [email protected] MaterialsR scripts can be found at https://github.com/JonesLabIdaho/EREfinder


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Justin Y. Lee ◽  
Britney Nguyen ◽  
Carlos Orosco ◽  
Mark P. Styczynski

Abstract Background The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms—two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: stepwise classification of unknown regulation, or SCOUR. Results We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32 to 88% for noiseless data, 9.2 to 49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6–27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification. Conclusions SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.


Author(s):  
Yanrong Ji ◽  
Zhihan Zhou ◽  
Han Liu ◽  
Ramana V Davuluri

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.


Water ◽  
2021 ◽  
Vol 13 (8) ◽  
pp. 1109
Author(s):  
Nobuaki Kimura ◽  
Kei Ishida ◽  
Daichi Baba

Long-term climate change may strongly affect the aquatic environment in mid-latitude water resources. In particular, it can be demonstrated that temporal variations in surface water temperature in a reservoir have strong responses to air temperature. We adopted deep neural networks (DNNs) to understand the long-term relationships between air temperature and surface water temperature, because DNNs can easily deal with nonlinear data, including uncertainties, that are obtained in complicated climate and aquatic systems. In general, DNNs cannot appropriately predict unexperienced data (i.e., out-of-range training data), such as future water temperature. To improve this limitation, our idea is to introduce a transfer learning (TL) approach. The observed data were used to train a DNN-based model. Continuous data (i.e., air temperature) ranging over 150 years to pre-training to climate change, which were obtained from climate models and include a downscaling model, were used to predict past and future surface water temperatures in the reservoir. The results showed that the DNN-based model with the TL approach was able to approximately predict based on the difference between past and future air temperatures. The model suggested that the occurrences in the highest water temperature increased, and the occurrences in the lowest water temperature decreased in the future predictions.


Electronics ◽  
2021 ◽  
Vol 10 (15) ◽  
pp. 1807
Author(s):  
Sascha Grollmisch ◽  
Estefanía Cano

Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement.


Sign in / Sign up

Export Citation Format

Share Document