scholarly journals Bias-invariant RNA-sequencing metadata annotation

GigaScience ◽  
2021 ◽  
Vol 10 (9) ◽  
Author(s):  
Hannes Wartmann ◽  
Sven Heins ◽  
Karin Kloiber ◽  
Stefan Bonn

Abstract Background Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. Findings Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning–based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression–based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. Conclusion Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.

2020 ◽  
Author(s):  
Hannes Wartmann ◽  
Sven Heins ◽  
Karin Kloiber ◽  
Stefan Bonn

AbstractRecent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Here we investigate RNA-seq metadata prediction based on gene expression values. We present a deep-learning based domain adaptation algorithm for the automatic annotation of RNA-seq metadata. We show how our algorithm outperforms existing approaches as well as traditional deep learning methods for the prediction of tissue, sample source, and patient sex information across several large data repositories. By using a model architecture similar to siamese networks the algorithm is able to learn biases from datasets with few samples. Our domain adaptation approach achieves metadata annotation accuracies up to 12.3% better than a previously published method. Lastly, we provide a list of more than 10,000 novel tissue and sex label annotations for 8,495 unique SRA samples.


2021 ◽  
Author(s):  
Talal Ahmed ◽  
Mark A Carty ◽  
Stephane Wenric ◽  
Jonathan R Dry ◽  
Ameen Abdulla Salahudeen ◽  
...  

Reproducibility of results obtained using RNA data across labs remains a major hurdle in cancer research. Often, molecular predictors trained on one dataset cannot be applied to another due to differences in RNA library preparation and quantification. While current RNA correction algorithms may overcome these differences, they require access to patient-level data which carries inherent risk of loss of privacy. Here, we describe SpinAdapt, a novel unsupervised domain adaptation algorithm that enables the transfer of molecular models across laboratories without access to patient-level sequencing data thereby minimizing privacy risk. SpinAdapt computes data corrections via aggregate statistics of each dataset, rather than requiring full sample-level data access, thereby maintaining patient data privacy. Furthermore, decoupling the model from its training data allows the correction of new streaming prospective data, enabling model evaluation on validation cohorts. SpinAdapt outperforms current correction methods that require patient-level data access. We expect this novel correction paradigm to enhance research reproducibility, quality, and patient privacy.


2021 ◽  
Vol 7 (2) ◽  
pp. 31
Author(s):  
Penghao Zhang ◽  
Jiayue Li ◽  
Yining Wang ◽  
Judong Pan

Convolutional neural networks (CNNs) have demonstrated great achievement in increasing the accuracy and stability of medical image segmentation. However, existing CNNs are limited by the problem of dependency on the availability of training data owing to high manual annotation costs and privacy issues. To counter this limitation, domain adaptation (DA) and few-shot learning have been extensively studied. Inspired by these two categories of approaches, we propose an optimization-based meta-learning method for segmentation tasks. Even though existing meta-learning methods use prior knowledge to choose parameters that generalize well from few examples, these methods limit the diversity of the task distribution that they can learn from in medical image segmentation. In this paper, we propose a meta-learning algorithm to augment the existing algorithms with the capability to learn from diverse segmentation tasks across the entire task distribution. Specifically, our algorithm aims to learn from the diversity of image features which characterize a specific tissue type while showing diverse signal intensities. To demonstrate the effectiveness of the proposed algorithm, we conducted experiments using a diverse set of segmentation tasks from the Medical Segmentation Decathlon and two meta-learning benchmarks: model-agnostic meta-learning (MAML) and Reptile. U-Net and Dice similarity coefficient (DSC) were selected as the baseline model and the main performance metric, respectively. The experimental results show that our algorithm maximally surpasses MAML and Reptile by 2% and 2.4% respectively, in terms of the DSC. By showing a consistent improvement in subjective measures, we can also infer that our algorithm can produce a better generalization of a target task that has few examples.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1573
Author(s):  
Loris Nanni ◽  
Giovanni Minchio ◽  
Sheryl Brahnam ◽  
Gianluca Maguolo ◽  
Alessandra Lumini

Traditionally, classifiers are trained to predict patterns within a feature space. The image classification system presented here trains classifiers to predict patterns within a vector space by combining the dissimilarity spaces generated by a large set of Siamese Neural Networks (SNNs). A set of centroids from the patterns in the training data sets is calculated with supervised k-means clustering. The centroids are used to generate the dissimilarity space via the Siamese networks. The vector space descriptors are extracted by projecting patterns onto the similarity spaces, and SVMs classify an image by its dissimilarity vector. The versatility of the proposed approach in image classification is demonstrated by evaluating the system on different types of images across two domains: two medical data sets and two animal audio data sets with vocalizations represented as images (spectrograms). Results show that the proposed system’s performance competes competitively against the best-performing methods in the literature, obtaining state-of-the-art performance on one of the medical data sets, and does so without ad-hoc optimization of the clustering methods on the tested data sets.


2020 ◽  
Vol 13 (1) ◽  
pp. 23
Author(s):  
Wei Zhao ◽  
William Yamada ◽  
Tianxin Li ◽  
Matthew Digman ◽  
Troy Runge

In recent years, precision agriculture has been researched to increase crop production with less inputs, as a promising means to meet the growing demand of agriculture products. Computer vision-based crop detection with unmanned aerial vehicle (UAV)-acquired images is a critical tool for precision agriculture. However, object detection using deep learning algorithms rely on a significant amount of manually prelabeled training datasets as ground truths. Field object detection, such as bales, is especially difficult because of (1) long-period image acquisitions under different illumination conditions and seasons; (2) limited existing prelabeled data; and (3) few pretrained models and research as references. This work increases the bale detection accuracy based on limited data collection and labeling, by building an innovative algorithms pipeline. First, an object detection model is trained using 243 images captured with good illimitation conditions in fall from the crop lands. In addition, domain adaptation (DA), a kind of transfer learning, is applied for synthesizing the training data under diverse environmental conditions with automatic labels. Finally, the object detection model is optimized with the synthesized datasets. The case study shows the proposed method improves the bale detecting performance, including the recall, mean average precision (mAP), and F measure (F1 score), from averages of 0.59, 0.7, and 0.7 (the object detection) to averages of 0.93, 0.94, and 0.89 (the object detection + DA), respectively. This approach could be easily scaled to many other crop field objects and will significantly contribute to precision agriculture.


Author(s):  
Jianqun Zhang ◽  
Qing Zhang ◽  
Xianrong Qin ◽  
Yuantao Sun

To identify rolling bearing faults under variable load conditions, a method named DISA-KNN is proposed in this paper, which is based on the strategy of feature extraction-domain adaptation-classification. To be specific, the time-domain and frequency-domain indicators are used for feature extraction. Discriminative and domain invariant subspace alignment (DISA) is used to minimize the data distributions’ discrepancies between the training data (source domain) and testing data (target domain). K-nearest neighbor (KNN) is applied to identify rolling bearing faults. DISA-KNN’s validation is proved by the experimental signal collected under different load conditions. The identification accuracies obtained by the DISA-KNN method are more than 90% on four datasets, including one dataset with 99.5% accuracy. The strength of the proposed method is further highlighted by comparisons with the other 8 methods. These results reveal that the proposed method is promising for the rolling bearing fault diagnosis in real rotating machinery.


2018 ◽  
Vol 35 (15) ◽  
pp. 2535-2544 ◽  
Author(s):  
Dipan Shaw ◽  
Hao Chen ◽  
Tao Jiang

AbstractMotivationIsoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms.ResultsWe evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions.Availability and implementationhttps://github.com/dls03/DeepIsoFun/Supplementary informationSupplementary data are available at Bioinformatics online.


Author(s):  
D. Gritzner ◽  
J. Ostermann

Abstract. Modern machine learning, especially deep learning, which is used in a variety of applications, requires a lot of labelled data for model training. Having an insufficient amount of training examples leads to models which do not generalize well to new input instances. This is a particular significant problem for tasks involving aerial images: often training data is only available for a limited geographical area and a narrow time window, thus leading to models which perform poorly in different regions, at different times of day, or during different seasons. Domain adaptation can mitigate this issue by using labelled source domain training examples and unlabeled target domain images to train a model which performs well on both domains. Modern adversarial domain adaptation approaches use unpaired data. We propose using pairs of semantically similar images, i.e., whose segmentations are accurate predictions of each other, for improved model performance. In this paper we show that, as an upper limit based on ground truth, using semantically paired aerial images during training almost always increases model performance with an average improvement of 4.2% accuracy and .036 mean intersection-over-union (mIoU). Using a practical estimate of semantic similarity, we still achieve improvements in more than half of all cases, with average improvements of 2.5% accuracy and .017 mIoU in those cases.


2020 ◽  
Vol 36 (17) ◽  
pp. 4633-4642 ◽  
Author(s):  
Karim Abbasi ◽  
Parvin Razzaghi ◽  
Antti Poso ◽  
Massoud Amanlou ◽  
Jahan B Ghasemi ◽  
...  

Abstract Motivation An essential part of drug discovery is the accurate prediction of the binding affinity of new compound–protein pairs. Most of the standard computational methods assume that compounds or proteins of the test data are observed during the training phase. However, in real-world situations, the test and training data are sampled from different domains with different distributions. To cope with this challenge, we propose a deep learning-based approach that consists of three steps. In the first step, the training encoder network learns a novel representation of compounds and proteins. To this end, we combine convolutional layers and long-short-term memory layers so that the occurrence patterns of local substructures through a protein and a compound sequence are learned. Also, to encode the interaction strength of the protein and compound substructures, we propose a two-sided attention mechanism. In the second phase, to deal with the different distributions of the training and test domains, a feature encoder network is learned for the test domain by utilizing an adversarial domain adaptation approach. In the third phase, the learned test encoder network is applied to new compound–protein pairs to predict their binding affinity. Results To evaluate the proposed approach, we applied it to KIBA, Davis and BindingDB datasets. The results show that the proposed method learns a more reliable model for the test domain in more challenging situations. Availability and implementation https://github.com/LBBSoft/DeepCDA.


Sign in / Sign up

Export Citation Format

Share Document