scholarly journals A deep learning framework for imputing missing values in genomic data

2018 ◽  
Author(s):  
Yeping Lina Qiu ◽  
Hong Zheng ◽  
Olivier Gevaert

AbstractMotivationThe presence of missing values is a frequent problem encountered in genomic data analysis. Lost data can be an obstacle to downstream analyses that require complete data matrices. State-of-the-art imputation techniques including Singular Value Decomposition (SVD) and K-Nearest Neighbors (KNN) based methods usually achieve good performances, but are computationally expensive especially for large datasets such as those involved in pan-cancer analysis.ResultsThis study describes a new method: a denoising autoencoder with partial loss (DAPL) as a deep learning based alternative for data imputation. Results on pan-cancer gene expression data and DNA methylation data from over 11,000 samples demonstrate significant improvement over standard denoising autoencoder for both data missing-at-random cases with a range of missing percentages, and missing-not-at-random cases based on expression level and GC-content. We discuss the advantages of DAPL over traditional imputation methods and show that it achieves comparable or better performance with less computational burden.Availabilityhttps://github.com/gevaertlab/[email protected]

GigaScience ◽  
2020 ◽  
Vol 9 (8) ◽  
Author(s):  
Yeping Lina Qiu ◽  
Hong Zheng ◽  
Olivier Gevaert

Abstract Background As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random. Results In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder. Conclusions We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.


Preprocessing is the presentation of raw data before apply the actual statistical method. Data preprocessing is one of the most vital steps in data mining process and it deals with the preparation and transformation of the initial dataset. It is prominent because the investigating data which is not properly preprocessed could lead to the result which is not accurate and meaningless. Almost every research have missing data and introduce an element into data analysis using some method. To consider the missing values that need to provide an efficient and valid analysis. Missing imputation is one of the process in data cleaning. Here, four different types of imputation methods are compared: Mean, Singular Value Decomposition (SVD), K-Nearest Neighbors (KNN), Bayesian Principal Component Analysis (BPCA). Comparison was performed in the real VASA dataset and based on performance evaluation criteria such as Mean Square Error (MSE) and Root Mean Square Error (RMSE). BPCA is the best imputation method of interest which deserve further consideration in practice.


2021 ◽  
Author(s):  
Troy M LaPolice ◽  
Yi-Fei Huang

Being able to predict essential genes intolerant to loss-of-function (LOF) mutations can dramatically improve our ability to identify genes associated with genetic disorders. Numerous computational methods have recently been developed to predict human essential genes from population genomic data; however, the existing methods have limited power in pinpointing short essential genes due to the sparsity of polymorphisms in the human genome. Here we present an evolution-based deep learning model, DeepLOF, which integrates population and functional genomic data to improve gene essentiality prediction. Compared to previous methods, DeepLOF shows unmatched performance in predicting ClinGen haploinsufficient genes, mouse essential genes, and essential genes in human cell lines. Furthermore, DeepLOF discovers 109 potentially essential genes that are too short to be identified by previous methods. Altogether, DeepLOF is a powerful computational method to aid in the discovery of essential genes.


2019 ◽  
Vol 2 (1) ◽  
pp. 22-34
Author(s):  
Sukanya Patra ◽  
Boudhayan Ganguly

Online recommender systems are an integral part of e-commerce. There are a plethora of algorithms following different approaches. However, most of the approaches except the singular value decomposition (SVD), do not provide any insight into the underlying patterns/concepts used in item rating. SVD used underlying features of movies but are computationally resource-heavy and performs poorly when there is data sparsity. In this article, we perform a comparative study among several pre-processing algorithms on SVD. In the experiments, we have used the MovieLens 1M dataset to compare the performance of these algorithms. KNN-based approach was used to find out K-nearest neighbors of users and their ratings were then used to impute the missing values. Experiments were conducted using different distance measures, such as Jaccard and Euclidian. We found that when the missing values were imputed using the mean of similar users and the distance measure was Euclidean, the KNN-based (K-Nearest Neighbour) approach of pre-processing the SVD was performing the best. Based on our comparative study, data managers can choose to employ the algorithm best suited for their business.


2020 ◽  
Author(s):  
Raniyaharini R ◽  
Madhumitha K ◽  
Mishaa S ◽  
Virajaravi R

2020 ◽  
Author(s):  
Jinseok Lee

BACKGROUND The coronavirus disease (COVID-19) has explosively spread worldwide since the beginning of 2020. According to a multinational consensus statement from the Fleischner Society, computed tomography (CT) can be used as a relevant screening tool owing to its higher sensitivity for detecting early pneumonic changes. However, physicians are extremely busy fighting COVID-19 in this era of worldwide crisis. Thus, it is crucial to accelerate the development of an artificial intelligence (AI) diagnostic tool to support physicians. OBJECTIVE We aimed to quickly develop an AI technique to diagnose COVID-19 pneumonia and differentiate it from non-COVID pneumonia and non-pneumonia diseases on CT. METHODS A simple 2D deep learning framework, named fast-track COVID-19 classification network (FCONet), was developed to diagnose COVID-19 pneumonia based on a single chest CT image. FCONet was developed by transfer learning, using one of the four state-of-art pre-trained deep learning models (VGG16, ResNet50, InceptionV3, or Xception) as a backbone. For training and testing of FCONet, we collected 3,993 chest CT images of patients with COVID-19 pneumonia, other pneumonia, and non-pneumonia diseases from Wonkwang University Hospital, Chonnam National University Hospital, and the Italian Society of Medical and Interventional Radiology public database. These CT images were split into a training and a testing set at a ratio of 8:2. For the test dataset, the diagnostic performance to diagnose COVID-19 pneumonia was compared among the four pre-trained FCONet models. In addition, we tested the FCONet models on an additional external testing dataset extracted from the embedded low-quality chest CT images of COVID-19 pneumonia in recently published papers. RESULTS Of the four pre-trained models of FCONet, the ResNet50 showed excellent diagnostic performance (sensitivity 99.58%, specificity 100%, and accuracy 99.87%) and outperformed the other three pre-trained models in testing dataset. In additional external test dataset using low-quality CT images, the detection accuracy of the ResNet50 model was the highest (96.97%), followed by Xception, InceptionV3, and VGG16 (90.71%, 89.38%, and 87.12%, respectively). CONCLUSIONS The FCONet, a simple 2D deep learning framework based on a single chest CT image, provides excellent diagnostic performance in detecting COVID-19 pneumonia. Based on our testing dataset, the ResNet50-based FCONet might be the best model, as it outperformed other FCONet models based on VGG16, Xception, and InceptionV3.


Sign in / Sign up

Export Citation Format

Share Document