scholarly journals Fast processing of environmental DNA metabarcoding sequence data using convolutional neural networks

2021 ◽  
Author(s):  
Benjamin Flück ◽  
Laëtitia Mathon ◽  
Stéphanie Manel ◽  
Alice Valentini ◽  
Tony Dejean ◽  
...  

The intensification of anthropogenic pressures have increased consequences on biodiversity and ultimately on the functioning of ecosystems. To monitor and better understand biodiversity responses to environmental changes using standardized and reproducible methods, novel high-throughput DNA sequencing is becoming a major tool. Indeed, organisms shed DNA traces in their environment and this "environmental DNA" (eDNA) can be collected and sequenced using eDNA metabarcoding. The processing of large volumes of eDNA metabarcoding data remains challenging, especially its transformation to relevant taxonomic lists that can be interpreted by experts. Speed and accuracy are two major bottlenecks in this critical step. Here, we investigate whether convolutional neural networks (CNN) can optimize the processing of short eDNA sequences. We tested whether the speed and accuracy of a CNN are comparable to that of the frequently used OBITools bioinformatic pipeline. We applied the methodology on a massive eDNA dataset collected in Tropical South America (French Guiana), where freshwater fishes were targeted using a small region (60pb) of the 12S ribosomal RNA mitochondrial gene. We found that the taxonomic assignments from the CNN were comparable to those of OBITools, with high correlation levels and a similar match to the regional fish fauna. The CNN allowed the processing of raw fastq files at a rate of approximately 1 million sequences per minute which was ~150 times faster than with OBITools. Once trained, the application of CNN to new eDNA metabarcoding data can be automated, which promises fast and easy deployment on the cloud for future eDNA analyses.

2018 ◽  
Author(s):  
Lex Flagel ◽  
Yaniv Brandvain ◽  
Daniel R. Schrider

ABSTRACTPopulation-scale genomic datasets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g. only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNN are capable of outperforming expert-derived statistical methods, and offer a new path forward in cases where no likelihood approach exists.


2018 ◽  
Author(s):  
Ghazaleh Khodabandelou ◽  
Etienne Routhier ◽  
Julien Mozziconacci

ABSTRACTDeep neural network application is today a skyrocketing field in many disciplinary domains. In genomics the development of deep neural networks is expected to revolutionize current practice. Several approaches relying on convolutional neural networks have been developed to associate short genomic sequences with a functional role such as promoters, enhancers or protein binding sites along genomes. These approaches rely on the generation of sequences batches with known annotations for learning purpose. While they show good performance to predict annotations from a test subset of these batches, they usually perform poorly when applied genome-wide.In this study, we address this issue and propose an optimal strategy to train convolutional neural networks for this specific application. We use as a case study transcription start sites and show that a model trained on one organism can be used to predict transcription start sites in a different specie. This cross-species application of convolutional neural networks trained with genomic sequence data provides a new technique to annotate any genome from previously existing annotations in related species. It also provides a way to determine whether the sequence patterns recognized by chromatin associated proteins in different species are conserved or not.


Author(s):  
Graham Gower ◽  
Pablo Iáñez Picazo ◽  
Matteo Fumagalli ◽  
Fernando Racimo

AbstractStudies in a variety of species have shown evidence for positively selected variants introduced into one population via introgression from another, distantly related population—a process known as adaptive introgression. However, there are few explicit frameworks for jointly modelling introgression and positive selection, in order to detect these variants using genomic sequence data. Here, we develop an approach based on convolutional neural networks (CNNs). CNNs do not require the specification of an analytical model of allele frequency dynamics, and have outperformed alternative methods for classification and parameter estimation tasks in various areas of population genetics. Thus, they are potentially well suited to the identification of adaptive introgression. Using simulations, we trained CNNs on genotype matrices derived from genomes sampled from the donor population, the recipient population and a related non-introgressed population, in order to distinguish regions of the genome evolving under adaptive introgression from those evolving neutrally or experiencing selective sweeps. Our CNN architecture exhibits 95% accuracy on simulated data, even when the genomes are unphased, and accuracy decreases only moderately in the presence of heterosis. As a proof of concept, we applied our trained CNNs to human genomic datasets—both phased and unphased—to detect candidates for adaptive introgression that shaped our evolutionary history.


2017 ◽  
Vol 2017 ◽  
pp. 1-8 ◽  
Author(s):  
Jing Wang ◽  
Cheng Ling ◽  
Jingyang Gao

Many structural variations (SVs) detection methods have been proposed due to the popularization of next-generation sequencing (NGS). These SV calling methods use different SV-property-dependent features; however, they all suffer from poor accuracy when running on low coverage sequences. The union of results from these tools achieves fairly high sensitivity but still produces low accuracy on low coverage sequence data. That is, these methods contain many false positives. In this paper, we present CNNdel, an approach for calling deletions from paired-end reads. CNNdel gathers SV candidates reported by multiple tools and then extracts features from aligned BAM files at the positions of candidates. With labeled feature-expressed candidates as a training set, CNNdel trains convolutional neural networks (CNNs) to distinguish true unlabeled candidates from false ones. Results show that CNNdel works well with NGS reads from 26 low coverage genomes of the 1000 Genomes Project. The paper demonstrates that convolutional neural networks can automatically assign the priority of SV features and reduce the false positives efficaciously.


eLife ◽  
2021 ◽  
Vol 10 ◽  
Author(s):  
Graham Gower ◽  
Pablo Iáñez Picazo ◽  
Matteo Fumagalli ◽  
Fernando Racimo

Studies in a variety of species have shown evidence for positively selected variants introduced into a population via introgression from another, distantly related population - a process known as adaptive introgression. However, there are few explicit frameworks for jointly modelling introgression and positive selection, in order to detect these variants using genomic sequence data. Here, we develop an approach based on convolutional neural networks (CNNs). CNNs do not require the specification of an analytical model of allele frequency dynamics, and have outperformed alternative methods for classification and parameter estimation tasks in various areas of population genetics. Thus, they are potentially well suited to the identification of adaptive introgression. Using simulations, we trained CNNs on genotype matrices derived from genomes sampled from the donor population, the recipient population and a related non-introgressed population, in order to distinguish regions of the genome evolving under adaptive introgression from those evolving neutrally or experiencing selective sweeps. Our CNN architecture exhibits 95% accuracy on simulated data, even when the genomes are unphased, and accuracy decreases only moderately in the presence of heterosis. As a proof of concept, we applied our trained CNNs to human genomic datasets - both phased and unphased - to detect candidates for adaptive introgression that shaped our evolutionary history.


2021 ◽  
Vol 4 ◽  
Author(s):  
Petra Nowak ◽  
Christina Wiebe ◽  
Rolf Karez ◽  
Hendrik Schubert

The use of environmental DNA (eDNA) for qualitative species inventories offers great potential as a cost-effective tool for species identification. This requires that the target species release DNA, reference information is available and detection methods exist. Environmental DNA analyses are currently used routinely to inventory fish fauna (Wang et al. 2021), molluscs (Klymus et al. 2017) or insects (Uchida et al. 2020). For other groups, such as macrophytes, there is not much information available (Scriver et al. 2015). In plants, identifying suitable eDNA markers been much more challenging, as no single DNA region has been accepted for the purposes of barcoding. Within this project, we assessed if stoneworts (Charophytes, Characeae) can be detected by using eDNA analysis and if it can be used to support macrophyte monitoring. Charophytes are macroscopic green algae which, because of their role as habitat engineers, are of special importance for aquatic ecosystems. Many charophyte species are bound to clean, nutrient-poor fresh and brackish waters (e.g. Melzer 1999) and are regarded bioindicators for water quality by national and international directives (e.g. Habitats Directive, EU Water Framework Directive). Being sensitive to anthropogenic pressures, a drastic decline in populations with increasing eutrophication has been reported (Sand-Jensen et al. 2017) . However, the diversity of Characeae is often underestimated due to difficulties in morphological determination, and the genetic identification of charophytes has been established only in the recent few years (e.g. Nowak et al. 2016). We assessed the potential utility of eDNA to document the diversity of charophyte species. eDNA from a fresh water lake (Dreetzsee, Germany, 2018) and from a brackish water site (Darß-Zingst Lagoon System, Germany, 2018) was extracted from filtered or ethanol‐precipitated water samples, and we designed and tested eDNA markers based on four regions of the chloroplast genome - atpB, rbcL, psbC, and matK. Of the four regions, matK and rbcL were most likely to amplify DNA from charophyte species. Both sites exhibit a diverse charophyte flora, which we successfully could identify to species/group level by eDNA analysis. In a current study, the developed eDNA markers are used to scrutinize the charophyte population of the Schlei estuary (Germany, Schleswig-Holstein). Since conventional monitoring can only be carried out once a year at a few sites, Characeae have not been observed in recent years, or only very sporadically. As it is not possible to survey the entire Schlei, especially due to high water turbidity, the eDNA methodology is tested to assess the presence of Characeae species.


2017 ◽  
Author(s):  
Stefan Budach ◽  
Annalisa Marsico

AbstractSummaryConvolutional neural networks (CNNs) have been shown to perform exceptionally well in a variety of tasks, including biological sequence classification. Available implementations, however, are usually optimized for a particular task and difficult to reuse. To enable researchers to utilize these networks more easily we implemented pysster, a Python package for training CNNs on biological sequence data. Sequences are classified by learning sequence and structure motifs and the package offers an automated hyper-parameter optimization procedure and options to visualize learned motifs along with information about their positional and class enrichment. The package runs seamlessly on CPU and GPU and provides a simple interface to train and evaluate a network with a handful lines of code. Using an RNA A-to-I editing data set and CLIP-seq binding site sequences we demonstrate that pysster classifies sequences with higher accuracy than other methods and is able to recover known sequence and structure motifs.Availabilitypysster is freely available at https://github.com/budach/[email protected], [email protected]


Sign in / Sign up

Export Citation Format

Share Document