scholarly journals Weighted elastic net for unsupervised domain adaptation with application to age prediction from DNA methylation data

2019 ◽  
Vol 35 (14) ◽  
pp. i154-i163 ◽  
Author(s):  
Lisa Handl ◽  
Adrin Jalali ◽  
Michael Scherer ◽  
Ralf Eggeling ◽  
Nico Pfeifer

Abstract Motivation Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains. Results We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared with a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples. Availability and implementation Source code is available at https://github.com/PfeiferLabTue/wenda. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Weiwei Zhang ◽  
Hao Wu ◽  
Ziyi Li

Abstract Motivation It is a common practice in epigenetics research to profile DNA methylation on tissue samples, which is usually a mixture of different cell types. To properly account for the mixture, estimating cell compositions has been recognized as an important first step. Many methods were developed for quantifying cell compositions from DNA methylation data, but they mostly have limited applications due to lack of reference or prior information. Results We develop Tsisal, a novel complete deconvolution method which accurately estimate cell compositions from DNA methylation data without any prior knowledge of cell types or their proportions. Tsisal is a full pipeline to estimate number of cell types, cell compositions, and identify cell-type-specific CpG sites. It can also assign cell type labels when (full or part of) reference panel is available. Extensive simulation studies and analyses of seven real data sets demonstrate the favorable performance of our proposed method compared with existing deconvolution methods serving similar purpose. Availability The proposed method Tsisal is implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] and [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (15) ◽  
pp. 2535-2544 ◽  
Author(s):  
Dipan Shaw ◽  
Hao Chen ◽  
Tao Jiang

AbstractMotivationIsoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms.ResultsWe evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions.Availability and implementationhttps://github.com/dls03/DeepIsoFun/Supplementary informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (19) ◽  
pp. 3786-3793 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Motivation DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. Results We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. Availability and implementation The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Lara Nonell ◽  
Juan R González

AbstractDNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data.To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios.According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.


Author(s):  
J. Becker ◽  
P. Böhme ◽  
A. Reckert ◽  
S. B. Eickhoff ◽  
B. E. Koop ◽  
...  

AbstractAs a contribution to the discussion about the possible effects of ethnicity/ancestry on age estimation based on DNA methylation (DNAm) patterns, we directly compared age-associated DNAm in German and Japanese donors in one laboratory under identical conditions. DNAm was analyzed by pyrosequencing for 22 CpG sites (CpGs) in the genes PDE4C, RPA2, ELOVL2, DDO, and EDARADD in buccal mucosa samples from German and Japanese donors (N = 368 and N = 89, respectively).Twenty of these CpGs revealed a very high correlation with age and were subsequently tested for differences between German and Japanese donors aged between 10 and 65 years (N = 287 and N = 83, respectively). ANCOVA was performed by testing the Japanese samples against age- and sex-matched German subsamples (N = 83 each; extracted 500 times from the German total sample). The median p values suggest a strong evidence for significant differences (p < 0.05) at least for two CpGs (EDARADD, CpG 2, and PDE4C, CpG 2) and no differences for 11 CpGs (p > 0.3).Age prediction models based on DNAm data from all 20 CpGs from German training data did not reveal relevant differences between the Japanese test samples and German subsamples. Obviously, the high number of included “robust CpGs” prevented relevant effects of differences in DNAm at two CpGs.Nevertheless, the presented data demonstrates the need for further research regarding the impact of confounding factors on DNAm in the context of ethnicity/ancestry to ensure a high quality of age estimation. One approach may be the search for “robust” CpG markers—which requires the targeted investigation of different populations, at best by collaborative research with coordinated research strategies.


2016 ◽  
Vol 27 (9) ◽  
pp. 2627-2640
Author(s):  
Chenyang Wang ◽  
Qi Shen ◽  
Li Du ◽  
Jinfeng Xu ◽  
Hong Zhang

DNA methylation has been shown to play an important role in many complex diseases. The rapid development of high-throughput DNA methylation scan technologies provides great opportunities for genomewide DNA methylation-disease association studies. As methylation is a dynamic process involving time, it is quite plausible that age contributes to its variation to a large extent. Therefore, in analyzing genomewide DNA methylation data, it is important to identify age-related DNA methylation marks and delineate their functional relationship. This helps us to better understand the underlying biological mechanism and facilitate early diagnosis and prognosis analysis of complex diseases. We develop a functional beta model for analyzing DNA methylation data and detecting age-related DNA methylation marks on the whole genome by naturally taking sampling scheme into account and accommodating flexible age-methylation dynamics. We focus on DNA methylation data obtained through the widely used bisulfite conversion technique and propose to use a beta model to relate the DNA methylation level to the age. Adjusting for certain confounders, the functional age effect is left completely unspecified, offering great flexibility and allowing extra data dynamics. An efficient algorithm is developed for estimating unknown parameters, and the Wald test is used to detect age-related DNA methylation marks. Simulation studies and several real data applications were provided to demonstrate the performance of the proposed method.


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Yuanyuan Zhang ◽  
Shudong Wang ◽  
Xinzeng Wang

Background. DNA methylation is essential for regulating gene expression, and the changes of DNA methylation status are commonly discovered in disease. Therefore, identification of differentially methylation patterns, especially differentially methylated regions (DMRs), in two different groups is important for understanding the mechanism of complex diseases. Few tools exist for DMR identification through considering features of methylation data, but there is no comprehensive integration of the characteristics of DNA methylation data in current methods. Results. Accounting for the characteristics of methylation data, such as the correlation characteristics of neighboring CpG sites and the high heterogeneity of DNA methylation data, we propose a data-driven approach for DMR identification through evaluating the energy of single site using modified 1D Ising model. Applied to both simulated and publicly available datasets, our approach is compared with other popular methods in terms of performance. Simulated results show that our method is more sensitive than competing methods. Applied to the real data, our method can identify more common DMRs than DMRcate, ProbeLasso, and Wang’s methods with a high overlapping ratio. Also, the necessity of integrating the heterogeneity and correlation characteristics in identifying DMR is shown through comparing results with only considering mean or variance signals and without considering relationship of neighboring CpG sites, respectively. Through analyzing the number of DMRs identified in real data located in different genomic regions, we find that about 90% DMRs are located in CGI which always regulates the expression of genes. It may help us understand the functional effect of DNA methylation on disease.


Genes ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 424 ◽  
Author(s):  
Xingyan Li ◽  
Weidong Li ◽  
Yan Xu

All tissues of organisms will become old as time goes on. In recent years, epigenetic investigations have found that there is a close correlation between DNA methylation and aging. With the development of DNA methylation research, a quantitative statistical relationship between DNA methylation and different ages was established based on the change rule of methylation with age, it is then possible to predict the age of individuals. All the data in this work were retrieved from the Illumina HumanMethylation BeadChip platform (27K or 450K). We analyzed 16 sets of healthy samples and 9 sets of diseased samples. The healthy samples included a total of 1899 publicly available blood samples (0–103 years old) and the diseased samples included 2395 blood samples. Six age-related CpG sites were selected through calculating Pearson correlation coefficients between age and DNA methylation values. We built a gradient boosting regressor model for these age-related CpG sites. 70% of the data was randomly selected as training data and the other 30% as independent data in each dataset for 25 runs in total. In the training dataset, the healthy samples showed that the correlation between predicted age and DNA methylation was 0.97, and the mean absolute deviation (MAD) was 2.72 years. In the independent dataset, the MAD was 4.06 years. The proposed model was further tested using the diseased samples. The MAD was 5.44 years for the training dataset and 7.08 years for the independent dataset. Furthermore, our model worked well when it was applied to saliva samples. These results illustrated that the age prediction based on six DNA methylation markers is very effective using the gradient boosting regressor.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Zane K. J. Hartley ◽  
Aaron S. Jackson ◽  
Michael Pound ◽  
Andrew P. French

3D reconstruction of fruit is important as a key component of fruit grading and an important part of many size estimation pipelines. Like many computer vision challenges, the 3D reconstruction task suffers from a lack of readily available training data in most domains, with methods typically depending on large datasets of high-quality image-model pairs. In this paper, we propose an unsupervised domain-adaptation approach to 3D reconstruction where labelled images only exist in our source synthetic domain, and training is supplemented with different unlabelled datasets from the target real domain. We approach the problem of 3D reconstruction using volumetric regression and produce a training set of 25,000 pairs of images and volumes using hand-crafted 3D models of bananas rendered in a 3D modelling environment (Blender). Each image is then enhanced by a GAN to more closely match the domain of photographs of real images by introducing a volumetric consistency loss, improving performance of 3D reconstruction on real images. Our solution harnesses the cost benefits of synthetic data while still maintaining good performance on real world images. We focus this work on the task of 3D banana reconstruction from a single image, representing a common task in plant phenotyping, but this approach is general and may be adapted to any 3D reconstruction task including other plant species and organs.


Sign in / Sign up

Export Citation Format

Share Document