Detecting adaptive introgression in human evolution using convolutional neural networks

Studies in a variety of species have shown evidence for positively selected variants introduced into a population via introgression from another, distantly related population - a process known as adaptive introgression. However, there are few explicit frameworks for jointly modelling introgression and positive selection, in order to detect these variants using genomic sequence data. Here, we develop an approach based on convolutional neural networks (CNNs). CNNs do not require the specification of an analytical model of allele frequency dynamics, and have outperformed alternative methods for classification and parameter estimation tasks in various areas of population genetics. Thus, they are potentially well suited to the identification of adaptive introgression. Using simulations, we trained CNNs on genotype matrices derived from genomes sampled from the donor population, the recipient population and a related non-introgressed population, in order to distinguish regions of the genome evolving under adaptive introgression from those evolving neutrally or experiencing selective sweeps. Our CNN architecture exhibits 95% accuracy on simulated data, even when the genomes are unphased, and accuracy decreases only moderately in the presence of heterosis. As a proof of concept, we applied our trained CNNs to human genomic datasets - both phased and unphased - to detect candidates for adaptive introgression that shaped our evolutionary history.

Download Full-text

Detecting adaptive introgression in human evolution using convolutional neural networks

10.1101/2020.09.18.301069 ◽

2020 ◽

Cited By ~ 2

Author(s):

Graham Gower ◽

Pablo Iáñez Picazo ◽

Matteo Fumagalli ◽

Fernando Racimo

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Genomic Sequence ◽

Sequence Data ◽

Simulated Data ◽

Alternative Methods ◽

Adaptive Introgression ◽

Donor Population ◽

Human Genomic ◽

Related Population

AbstractStudies in a variety of species have shown evidence for positively selected variants introduced into one population via introgression from another, distantly related population—a process known as adaptive introgression. However, there are few explicit frameworks for jointly modelling introgression and positive selection, in order to detect these variants using genomic sequence data. Here, we develop an approach based on convolutional neural networks (CNNs). CNNs do not require the specification of an analytical model of allele frequency dynamics, and have outperformed alternative methods for classification and parameter estimation tasks in various areas of population genetics. Thus, they are potentially well suited to the identification of adaptive introgression. Using simulations, we trained CNNs on genotype matrices derived from genomes sampled from the donor population, the recipient population and a related non-introgressed population, in order to distinguish regions of the genome evolving under adaptive introgression from those evolving neutrally or experiencing selective sweeps. Our CNN architecture exhibits 95% accuracy on simulated data, even when the genomes are unphased, and accuracy decreases only moderately in the presence of heterosis. As a proof of concept, we applied our trained CNNs to human genomic datasets—both phased and unphased—to detect candidates for adaptive introgression that shaped our evolutionary history.

Download Full-text

Genome Functional Annotation across Species using Deep Convolutional Neural Networks

10.1101/330308 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ghazaleh Khodabandelou ◽

Etienne Routhier ◽

Julien Mozziconacci

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Genomic Sequence ◽

Sequence Data ◽

Transcription Start ◽

Deep Convolutional Neural Networks ◽

Transcription Start Sites ◽

Genome Wide ◽

Associated Proteins ◽

A New Technique

ABSTRACTDeep neural network application is today a skyrocketing field in many disciplinary domains. In genomics the development of deep neural networks is expected to revolutionize current practice. Several approaches relying on convolutional neural networks have been developed to associate short genomic sequences with a functional role such as promoters, enhancers or protein binding sites along genomes. These approaches rely on the generation of sequences batches with known annotations for learning purpose. While they show good performance to predict annotations from a test subset of these batches, they usually perform poorly when applied genome-wide.In this study, we address this issue and propose an optimal strategy to train convolutional neural networks for this specific application. We use as a case study transcription start sites and show that a model trained on one organism can be used to predict transcription start sites in a different specie. This cross-species application of convolutional neural networks trained with genomic sequence data provides a new technique to annotate any genome from previously existing annotations in related species. It also provides a way to determine whether the sequence patterns recognized by chromatin associated proteins in different species are conserved or not.

Download Full-text

The use of Convolutional Neural Networks for signal-background classification in Particle Physics experiments

EPJ Web of Conferences ◽

10.1051/epjconf/202024506003 ◽

2020 ◽

Vol 245 ◽

pp. 06003

Author(s):

Venkitesh Ayyar ◽

Wahid Bhimji ◽

Lisa Gerhardt ◽

Sally Robertson ◽

Zahra Ronaghi

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Particle Physics ◽

Simulated Data ◽

Image Data ◽

Use Case ◽

Neural Architecture ◽

Ice Cube ◽

2D And 3D ◽

Physics Experiments

The success of Convolutional Neural Networks (CNNs) in image classification has prompted efforts to study their use for classifying image data obtained in Particle Physics experiments. Here, we discuss our efforts to apply CNNs to 2D and 3D image data from particle physics experiments to classify signal from background. In this work we present an extensive convolutional neural architecture search, achieving high accuracy for signal/background discrimination for a HEP classification use-case based on simulated data from the Ice Cube neutrino observatory and an ATLAS-like detector. We demonstrate among other things that we can achieve the same accuracy as complex ResNet architectures with CNNs with less parameters, and present comparisons of computational requirements, training and inference times.

Download Full-text

Representation learning of genomic sequence motifs with convolutional neural networks

PLoS Computational Biology ◽

10.1371/journal.pcbi.1007560 ◽

2019 ◽

Vol 15 (12) ◽

pp. e1007560 ◽

Cited By ~ 9

Author(s):

Peter K. Koo ◽

Sean R. Eddy

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Genomic Sequence ◽

Representation Learning ◽

Sequence Motifs

Download Full-text

The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference

10.1101/336073 ◽

2018 ◽

Cited By ~ 3

Author(s):

Lex Flagel ◽

Yaniv Brandvain ◽

Daniel R. Schrider

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Population Genetic ◽

Sequence Data ◽

Input Sequence ◽

Evolutionary Model ◽

Sequence Alignments ◽

Likelihood Approach ◽

Population Genetic Inference ◽

Genetic Inference

ABSTRACTPopulation-scale genomic datasets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g. only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNN are capable of outperforming expert-derived statistical methods, and offer a new path forward in cases where no likelihood approach exists.

Download Full-text

Faculty Opinions recommendation of Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.737976542.793577284 ◽

2020 ◽

Author(s):

Erich Bornberg-Bauer ◽

Daniel Dowling

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Genomic Sequence ◽

Mrna Abundance ◽

Deep Convolutional Neural Networks

Download Full-text

Faculty Opinions recommendation of Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.737976542.793586931 ◽

2021 ◽

Author(s):

Roderic Guigo ◽

Manuel Muñoz Aguirre

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Genomic Sequence ◽

Mrna Abundance ◽

Deep Convolutional Neural Networks

Download Full-text

Fast processing of environmental DNA metabarcoding sequence data using convolutional neural networks

10.1101/2021.05.22.445213 ◽

2021 ◽

Author(s):

Benjamin Flück ◽

Laëtitia Mathon ◽

Stéphanie Manel ◽

Alice Valentini ◽

Tony Dejean ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Environmental Changes ◽

Sequence Data ◽

Mitochondrial Gene ◽

Fish Fauna ◽

Environmental Dna ◽

Small Region ◽

Anthropogenic Pressures ◽

Speed And Accuracy

The intensification of anthropogenic pressures have increased consequences on biodiversity and ultimately on the functioning of ecosystems. To monitor and better understand biodiversity responses to environmental changes using standardized and reproducible methods, novel high-throughput DNA sequencing is becoming a major tool. Indeed, organisms shed DNA traces in their environment and this "environmental DNA" (eDNA) can be collected and sequenced using eDNA metabarcoding. The processing of large volumes of eDNA metabarcoding data remains challenging, especially its transformation to relevant taxonomic lists that can be interpreted by experts. Speed and accuracy are two major bottlenecks in this critical step. Here, we investigate whether convolutional neural networks (CNN) can optimize the processing of short eDNA sequences. We tested whether the speed and accuracy of a CNN are comparable to that of the frequently used OBITools bioinformatic pipeline. We applied the methodology on a massive eDNA dataset collected in Tropical South America (French Guiana), where freshwater fishes were targeted using a small region (60pb) of the 12S ribosomal RNA mitochondrial gene. We found that the taxonomic assignments from the CNN were comparable to those of OBITools, with high correlation levels and a similar match to the regional fish fauna. The CNN allowed the processing of raw fastq files at a rate of approximately 1 million sequences per minute which was ~150 times faster than with OBITools. Once trained, the application of CNN to new eDNA metabarcoding data can be automated, which promises fast and easy deployment on the cloud for future eDNA analyses.

Download Full-text

Background Rejection using Convolutional Neural Networks

Proceedings of the International Astronomical Union ◽

10.1017/s1743921318000492 ◽

2017 ◽

Vol 13 (S338) ◽

pp. 37-39

Author(s):

Adam Zadrożny ◽

Beata Goźlińska

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Early Stage ◽

Simulated Data ◽

Wide Field ◽

Proof Of Concept ◽

Field Surveys ◽

Background Rejection ◽

Very High

AbstractThe paper presents a proof of concept method of background rejection based on convolutional neural networks (CNN). The method was tested on simulated data and achieved very high accuracy (100%). What is more, method based on CNN is very fast and could be easily applied to wide field surveys. Since early stage results suggest method is very accurate and robust, it could be helpful in creating very low-latency pipelines for EM Follow-up purposes, which will be needed in LIGO-Virgo O3 EM Follow-up.

Download Full-text

Classifying exoplanet candidates with convolutional neural networks: application to the Next Generation Transit Survey

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2058 ◽

2019 ◽

Vol 488 (4) ◽

pp. 5232-5250 ◽

Cited By ~ 2

Author(s):

Alexander Chaushev ◽

Liam Raynard ◽

Michael R Goad ◽

Philipp Eigmüller ◽

David J Armstrong ◽

...

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Network Performance ◽

Area Under The Curve ◽

Simulated Data ◽

Real Data ◽

Training Data ◽

Next Generation ◽

Data Set ◽

Time Required

ABSTRACT Vetting of exoplanet candidates in transit surveys is a manual process, which suffers from a large number of false positives and a lack of consistency. Previous work has shown that convolutional neural networks (CNN) provide an efficient solution to these problems. Here, we apply a CNN to classify planet candidates from the Next Generation Transit Survey (NGTS). For training data sets we compare both real data with injected planetary transits and fully simulated data, as well as how their different compositions affect network performance. We show that fewer hand labelled light curves can be utilized, while still achieving competitive results. With our best model, we achieve an area under the curve (AUC) score of $(95.6\pm {0.2}){{\ \rm per\ cent}}$ and an accuracy of $(88.5\pm {0.3}){{\ \rm per\ cent}}$ on our unseen test data, as well as $(76.5\pm {0.4}){{\ \rm per\ cent}}$ and $(74.6\pm {1.1}){{\ \rm per\ cent}}$ in comparison to our existing manual classifications. The neural network recovers 13 out of 14 confirmed planets observed by NGTS, with high probability. We use simulated data to show that the overall network performance is resilient to mislabelling of the training data set, a problem that might arise due to unidentified, low signal-to-noise transits. Using a CNN, the time required for vetting can be reduced by half, while still recovering the vast majority of manually flagged candidates. In addition, we identify many new candidates with high probabilities which were not flagged by human vetters.

Download Full-text