scholarly journals Improving variant calling using population data and deep learning

2021 ◽  
Author(s):  
Nae-Chyun Chen ◽  
Alexey Kolesnikov ◽  
Sidharth Goel ◽  
Taedong Yun ◽  
Pi-Chuan Chang ◽  
...  

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we modify DeepVariant to add a new channel encoding population allele frequencies from the 1000 Genomes Project. We show that this model reduces variant calling errors, improving both precision and recall. We assess the impact of using population-specific or diverse reference panels. We achieve the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.


2018 ◽  
Vol 27 (Supplement_R1) ◽  
pp. R63-R71 ◽  
Author(s):  
Amalio Telenti ◽  
Christoph Lippert ◽  
Pi-Chuan Chang ◽  
Mark DePristo

Abstract The human genome is now investigated through high-throughput functional assays, and through the generation of population genomic data. These advances support the identification of functional genetic variants and the prediction of traits (e.g. deleterious variants and disease). This review summarizes lessons learned from the large-scale analyses of genome and exome data sets, modeling of population data and machine-learning strategies to solve complex genomic sequence regions. The review also portrays the rapid adoption of artificial intelligence/deep neural networks in genomics; in particular, deep learning approaches are well suited to model the complex dependencies in the regulatory landscape of the genome, and to provide predictors for genetic variant calling and interpretation.



2021 ◽  
Vol 13 (3) ◽  
pp. 364
Author(s):  
Han Gao ◽  
Jinhui Guo ◽  
Peng Guo ◽  
Xiuwan Chen

Recently, deep learning has become the most innovative trend for a variety of high-spatial-resolution remote sensing imaging applications. However, large-scale land cover classification via traditional convolutional neural networks (CNNs) with sliding windows is computationally expensive and produces coarse results. Additionally, although such supervised learning approaches have performed well, collecting and annotating datasets for every task are extremely laborious, especially for those fully supervised cases where the pixel-level ground-truth labels are dense. In this work, we propose a new object-oriented deep learning framework that leverages residual networks with different depths to learn adjacent feature representations by embedding a multibranch architecture in the deep learning pipeline. The idea is to exploit limited training data at different neighboring scales to make a tradeoff between weak semantics and strong feature representations for operational land cover mapping tasks. We draw from established geographic object-based image analysis (GEOBIA) as an auxiliary module to reduce the computational burden of spatial reasoning and optimize the classification boundaries. We evaluated the proposed approach on two subdecimeter-resolution datasets involving both urban and rural landscapes. It presented better classification accuracy (88.9%) compared to traditional object-based deep learning methods and achieves an excellent inference time (11.3 s/ha).



2021 ◽  
Vol 7 (3) ◽  
pp. 59
Author(s):  
Yohanna Rodriguez-Ortega ◽  
Dora M. Ballesteros ◽  
Diego Renza

With the exponential growth of high-quality fake images in social networks and media, it is necessary to develop recognition algorithms for this type of content. One of the most common types of image and video editing consists of duplicating areas of the image, known as the copy-move technique. Traditional image processing approaches manually look for patterns related to the duplicated content, limiting their use in mass data classification. In contrast, approaches based on deep learning have shown better performance and promising results, but they present generalization problems with a high dependence on training data and the need for appropriate selection of hyperparameters. To overcome this, we propose two approaches that use deep learning, a model by a custom architecture and a model by transfer learning. In each case, the impact of the depth of the network is analyzed in terms of precision (P), recall (R) and F1 score. Additionally, the problem of generalization is addressed with images from eight different open access datasets. Finally, the models are compared in terms of evaluation metrics, and training and inference times. The model by transfer learning of VGG-16 achieves metrics about 10% higher than the model by a custom architecture, however, it requires approximately twice as much inference time as the latter.



Sensors ◽  
2018 ◽  
Vol 18 (10) ◽  
pp. 3232 ◽  
Author(s):  
Yan Liu ◽  
Qirui Ren ◽  
Jiahui Geng ◽  
Meng Ding ◽  
Jiangyun Li

Efficient and accurate semantic segmentation is the key technique for automatic remote sensing image analysis. While there have been many segmentation methods based on traditional hand-craft feature extractors, it is still challenging to process high-resolution and large-scale remote sensing images. In this work, a novel patch-wise semantic segmentation method with a new training strategy based on fully convolutional networks is presented to segment common land resources. First, to handle the high-resolution image, the images are split as local patches and then a patch-wise network is built. Second, training data is preprocessed in several ways to meet the specific characteristics of remote sensing images, i.e., color imbalance, object rotation variations and lens distortion. Third, a multi-scale training strategy is developed to solve the severe scale variation problem. In addition, the impact of conditional random field (CRF) is studied to improve the precision. The proposed method was evaluated on a dataset collected from a capital city in West China with the Gaofen-2 satellite. The dataset contains ten common land resources (Grassland, Road, etc.). The experimental results show that the proposed algorithm achieves 54.96% in terms of mean intersection over union (MIoU) and outperforms other state-of-the-art methods in remote sensing image segmentation.



PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254363
Author(s):  
Aji John ◽  
Kathleen Muenzen ◽  
Kristiina Ausmees

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.



2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>



Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Tao Chen ◽  
Mingfen Wu ◽  
Hexi Li

Abstract The automatic extraction of meaningful relations from biomedical literature or clinical records is crucial in various biomedical applications. Most of the current deep learning approaches for medical relation extraction require large-scale training data to prevent overfitting of the training model. We propose using a pre-trained model and a fine-tuning technique to improve these approaches without additional time-consuming human labeling. Firstly, we show the architecture of Bidirectional Encoder Representations from Transformers (BERT), an approach for pre-training a model on large-scale unstructured text. We then combine BERT with a one-dimensional convolutional neural network (1d-CNN) to fine-tune the pre-trained model for relation extraction. Extensive experiments on three datasets, namely the BioCreative V chemical disease relation corpus, traditional Chinese medicine literature corpus and i2b2 2012 temporal relation challenge corpus, show that the proposed approach achieves state-of-the-art results (giving a relative improvement of 22.2, 7.77, and 38.5% in F1 score, respectively, compared with a traditional 1d-CNN classifier). The source code is available at https://github.com/chentao1999/MedicalRelationExtraction.



BMC Neurology ◽  
2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Knut Hagen ◽  
Lars Jacob Stovner ◽  
Kristian Bernhard Nilsen ◽  
Espen Saxhaug Kristoffersen ◽  
Bendik Slagsvold Winsvold

Abstract Background Increased high sensitivity C- reactive protein (hs-CRP) levels have been found in many earlier studies on migraine, and recently also in persons with migraine and insomnia. The aim of this study was to see whether these findings could be reproduced in a large-scale population-based study. Methods A total of 50,807 (54%) out of 94,194 invited aged ≥20 years or older participated in the third wave of the Nord-Trøndelag Health Study study performed in 2006–2008. Among these, 38,807 (41%) had valid measures of hs-CRP and answered questions on headache and insomnia. Elevated hs-CRP was defined as > 3.0 mg/L. The cross-sectional association with headache was estimated by multivariate analyses using multiple logistic regression. The precision of the odds ratio (OR) was assessed with 95% confidence interval (CI). Results In the fully adjusted model, elevated hs-CRP was associated with migraine (OR 1.14, 95% CI 1.04–1.25) and migraine with aura (OR 1.15, 95% CI 1.03–1.29). The association was strongest among individuals with headache ≥15 days/month for any headache (OR 1.26, 95% CI 1.08–1.48), migraine (OR 1.62, 95% CI 1.21–2.17), and migraine with aura (OR 1.84, 95% CI 1.27–2.67). No clear relationship was found between elevated hs-CRP and headache less than 7 days/month or with insomnia. Conclusions Cross-sectional data from this large-scale population-based study showed that elevated hs-CRP was associated with headache ≥7 days/month, especially evident for migraine with aura.



2020 ◽  
Vol 12 (18) ◽  
pp. 3053 ◽  
Author(s):  
Thorsten Hoeser ◽  
Felix Bachofer ◽  
Claudia Kuenzer

In Earth observation (EO), large-scale land-surface dynamics are traditionally analyzed by investigating aggregated classes. The increase in data with a very high spatial resolution enables investigations on a fine-grained feature level which can help us to better understand the dynamics of land surfaces by taking object dynamics into account. To extract fine-grained features and objects, the most popular deep-learning model for image analysis is commonly used: the convolutional neural network (CNN). In this review, we provide a comprehensive overview of the impact of deep learning on EO applications by reviewing 429 studies on image segmentation and object detection with CNNs. We extensively examine the spatial distribution of study sites, employed sensors, used datasets and CNN architectures, and give a thorough overview of applications in EO which used CNNs. Our main finding is that CNNs are in an advanced transition phase from computer vision to EO. Upon this, we argue that in the near future, investigations which analyze object dynamics with CNNs will have a significant impact on EO research. With a focus on EO applications in this Part II, we complete the methodological review provided in Part I.



2018 ◽  
Author(s):  
J Budis ◽  
J Gazdarica ◽  
J Radvanszky ◽  
M Harsanyova ◽  
I Gazdaricova ◽  
...  

AbstractLow-coverage massively parallel genome sequencing for non-invasive prenatal testing (NIPT) of common aneuploidies is one of the most rapidly adopted and relatively low-cost DNA tests. Since aggregation of reads from a large number of samples allows overcoming the problems of extremely low coverage of individual samples, we describe the possible re-use of the data generated during NIPT testing for genome scale population specific frequency determination of small DNA variants, requiring no additional costs except of those for the NIPT test itself. We applied our method to a data set comprising of 1,548 original NIPT test results and evaluated the findings on different levels, from in silico population frequency comparisons up to wet lab validation analyses using a gold-standard method. The revealed high reliability of variant calling and allelic frequency determinations suggest that these NIPT data could serve as valuable alternatives to large scale population studies even for smaller countries around the world.



Sign in / Sign up

Export Citation Format

Share Document