DeepTE: a computational method for de novo classification of transposons with convolutional neural network

Haidong Yan; Aureliano Bombarely; Song Li

doi:10.1093/bioinformatics/btaa519

DeepTE: a computational method for de novo classification of transposons with convolutional neural network

Bioinformatics ◽

10.1093/bioinformatics/btaa519 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4269-4275 ◽

Cited By ~ 3

Author(s):

Haidong Yan ◽

Aureliano Bombarely ◽

Song Li

Keyword(s):

De Novo ◽

Genomic Sequence ◽

Computational Method ◽

Supplementary Information ◽

Supplementary Data ◽

Model Species ◽

Essential Step ◽

Genomic Sequence Analysis ◽

Eukaryotic Genomes

Abstract Motivation Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis. Results We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks (CNNs). DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24 and 16 super families in plants, metazoans and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages CNN for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes. Availability and implementation DeepTE is accessible at https://github.com/LiLabAtVT/DeepTE. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DeepTE: a computational method for de novo classification of transposons with convolutional neural network

10.1101/2020.01.27.921874 ◽

2020 ◽

Author(s):

Haidong Yan ◽

Aureliano Bombarely ◽

Song Li

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

De Novo ◽

Genomic Sequence ◽

Computational Method ◽

Model Species ◽

Essential Step ◽

Genomic Sequence Analysis ◽

Eukaryotic Genomes

AbstractMotivationTransposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis.ResultsWe developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks. DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24, and 16 super families in plants, metazoans, and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages convolutional neural network for TE classification, and can be used to precisely identify and annotate TEs in newly sequenced eukaryotic genomes.AvailabilityDeepTE is accessible at https://github.com/LiLabAtVT/[email protected]

Download Full-text

Machine learning approaches to predict the Plant-associated phenotype of Xanthomonas strains

BMC Genomics ◽

10.1186/s12864-021-08093-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dennie te Molder ◽

Wasin Poncheewin ◽

Peter J. Schaap ◽

Jasper J. Koehorst

Keyword(s):

Machine Learning ◽

Plant Pathogens ◽

De Novo ◽

Classification Algorithms ◽

Learning Approaches ◽

Enabling Factors ◽

Essential Step ◽

The World ◽

Genome Content

Abstract Background The genus Xanthomonas has long been considered to consist predominantly of plant pathogens, but over the last decade there has been an increasing number of reports on non-pathogenic and endophytic members. As Xanthomonas species are prevalent pathogens on a wide variety of important crops around the world, there is a need to distinguish between these plant-associated phenotypes. To date a large number of Xanthomonas genomes have been sequenced, which enables the application of machine learning (ML) approaches on the genome content to predict this phenotype. Until now such approaches to the pathogenomics of Xanthomonas strains have been hampered by the fragmentation of information regarding pathogenicity of individual strains over many studies. Unification of this information into a single resource was therefore considered to be an essential step. Results Mining of 39 papers considering both plant-associated phenotypes, allowed for a phenotypic classification of 578 Xanthomonas strains. For 65 plant-pathogenic and 53 non-pathogenic strains the corresponding genomes were available and de novo annotated for the presence of Pfam protein domains used as features to train and compare three ML classification algorithms; CART, Lasso and Random Forest. Conclusion The literature resource in combination with recursive feature extraction used in the ML classification algorithms provided further insights into the virulence enabling factors, but also highlighted domains linked to traits not present in pathogenic strains.

Download Full-text

ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions

Bioinformatics ◽

10.1093/bioinformatics/btz431 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4754-4756 ◽

Cited By ~ 29

Author(s):

Egor Dolzhenko ◽

Viraj Deshpande ◽

Felix Schlesinger ◽

Peter Krusche ◽

Roman Petrovski ◽

...

Keyword(s):

Tandem Repeat ◽

Broad Class ◽

Source Code ◽

Computational Method ◽

Supplementary Information ◽

Dna Repeats ◽

Supplementary Data ◽

Sequence Graph ◽

Version 2.0 ◽

Short Tandem

Abstract Summary We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. Availability and implementation ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers

Bioinformatics ◽

10.1093/bioinformatics/btaa915 ◽

2020 ◽

Author(s):

Yuansheng Liu ◽

Xiaocai Zhang ◽

Quan Zou ◽

Xiangxiang Zeng

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Supplementary Data ◽

Complementary Strand ◽

Short Reads ◽

Sequencing Technologies ◽

Computational Resources

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Bioinformatics ◽

10.1093/bioinformatics/btaa440 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i75-i83 ◽

Cited By ~ 5

Author(s):

Alla Mikheenko ◽

Andrey V Bzikadze ◽

Alexey Gurevich ◽

Karen H Miga ◽

Pavel A Pevzner

Keyword(s):

Quality Assessment ◽

Chromosome Segregation ◽

Tandem Repeats ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Cellular Processes ◽

Long Reads ◽

Long Read ◽

Eukaryotic Genomes

Abstract Motivation Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. Results To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. Availability and implementation https://github.com/ablab/TandemTools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HiLight-PTM: an online application to aid matching peptide pairs with isotopically labelled PTMs

Bioinformatics ◽

10.1093/bioinformatics/btz654 ◽

2019 ◽

Author(s):

Harry J Whitwell ◽

Peter DiMaggio

Keyword(s):

De Novo ◽

De Novo Sequencing ◽

Mass Shift ◽

Supplementary Information ◽

Database Searching ◽

Supplementary Data ◽

Exact Match ◽

High Confidence ◽

Online Application ◽

Internet Browser

Abstract Motivation Database searching of isotopically labelled PTMs can be problematic and we frequently find that only one, or neither in a heavy/light pair are assigned. In such cases, having a pair of MS/MS spectra that differ due to an isotopic label can assist in identifying the relevant m/z values that support the correct peptide annotation or can be used for de novo sequencing. Results We have developed an online application that identifies matching peaks and peaks differing by the appropriate mass shift (difference between heavy and light PTM) between two MS/MS spectra. Furthermore, the application predicts, from the exact-match peaks, the mass of their complementary ions and highlights these as high confidence matches between the two spectra. The result is a tool to visually compare two spectra, and downloadable peaks lists that can be used to support de novo sequencing. Availability and implementation HiLight-PTM is released using shinyapps.io by RStudio, and can be accessed from any internet browser at https://harrywhitwell.shinyapps.io/hilight-ptm/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FOCUS2: agile and sensitive classification of metagenomics data using a reduced database

10.1101/046425 ◽

2016 ◽

Cited By ~ 2

Author(s):

Genivaldo Gueiros Z. Silva ◽

Bas E. Dutilh ◽

Robert A. Edwards

Keyword(s):

Microbial Community ◽

Dna Sequences ◽

Computational Method ◽

Environmental Research ◽

Supplementary Information ◽

Sequence Classification ◽

Computationally Efficient ◽

Link Type ◽

Metagenomics Data

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.

Download Full-text

RepeatModeler2 for automated genomic discovery of transposable element families

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1921046117 ◽

2020 ◽

Vol 117 (17) ◽

pp. 9451-9457 ◽

Cited By ~ 26

Author(s):

Jullien M. Flynn ◽

Robert Hubley ◽

Clément Goubert ◽

Jeb Rosen ◽

Andrew G. Clark ◽

...

Keyword(s):

De Novo ◽

Fruit Fly ◽

Automated Identification ◽

Sequence Coverage ◽

Model Species ◽

Consensus Sequences ◽

Sequence Complexity ◽

Link Type ◽

Eukaryotic Genomes ◽

Ltr Retroelements

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).

Download Full-text

mixtureS: a novel tool for bacterial strain genome reconstruction from reads

Bioinformatics ◽

10.1093/bioinformatics/btaa728 ◽

2020 ◽

Author(s):

Xin Li ◽

Haiyan Hu ◽

Xiaoman Li

Keyword(s):

Environmental Samples ◽

De Novo ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Bacterial Strains ◽

Metagenomic Sample ◽

Almost All ◽

User Friendly ◽

Strain Genome

Abstract Motivation It is essential to study bacterial strains in environmental samples. Existing methods and tools often depend on known strains or known variations, cannot work on individual samples, not reliable, or not easy to use, etc. It is thus important to develop more user-friendly tools that can identify bacterial strains more accurately. Results We developed a new tool called mixtureS that can de novo identify bacterial strains from shotgun reads of a clonal or metagenomic sample, without prior knowledge about the strains and their variations. Tested on 243 simulated datasets and 195 experimental datasets, mixtureS reliably identified the strains, their numbers and their abundance. Compared with three tools, mixtureS showed better performance in almost all simulated datasets and the vast majority of experimental datasets. Availability and implementation The source code and tool mixtureS is available at http://www.cs.ucf.edu/˜xiaoman/mixtureS/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SimkaMin: fast and resource frugal de novo comparative metagenomics

Bioinformatics ◽

10.1093/bioinformatics/btz685 ◽

2019 ◽

Author(s):

Gaëtan Benoit ◽

Mahendra Mariadassou ◽

Stéphane Robin ◽

Sophie Schbath ◽

Pierre Peterlongo ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Supplementary Information ◽

Metagenomic Data ◽

Supplementary Data ◽

Comparative Metagenomics ◽

Large Sets ◽

Efficient Data ◽

Genomic Similarity

Abstract Motivation De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. Results We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. Availability and implementation https://github.com/GATB/simka. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text