SwiftOrtho: a Fast, Memory-Efficient, Multiple Genome Orthology Classifier

Mapping Intimacies ◽

10.1101/543223 ◽

2019 ◽

Author(s):

Xiao Hu ◽

Iddo Friedberg

Keyword(s):

Protein Function ◽

Large Scale ◽

Comparative Genomic ◽

Analysis Tool ◽

Bacterial Genomes ◽

Function Annotation ◽

Large Scale Data ◽

Protein Function Annotation ◽

Genome Analyses ◽

Memory Efficient

AbstractIntroductionGene homology type classification is a requisite for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. A large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic datasets, these tools require high memory and CPU usage, typically available only in costly computational clusters. To address this problem, we developed a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data.ResultsIn our tests, SwiftOrtho is the only tool that completed orthology analysis of 1,760 bacterial genomes on a computer with only 4GB RAM. Using various standard orthology datasets, we also show that SwiftOrtho has a high accuracy. SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low memory computers.Availabilityhttps://github.com/Rinoahu/SwiftOrtho

SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier

GigaScience ◽

10.1093/gigascience/giz118 ◽

2019 ◽

Vol 8 (10) ◽

Cited By ~ 7

Author(s):

Xiao Hu ◽

Iddo Friedberg

Keyword(s):

Protein Function ◽

Large Scale ◽

Homology Search ◽

Comparative Genomic ◽

Data Sets ◽

Analysis Tool ◽

Memory Usage ◽

Spaced Seeds ◽

Speed Up ◽

Genome Analyses

Abstract Background Gene homology type classification is required for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. Consequently, a large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic data sets, these tools require high memory and CPU usage, typically available only in computational clusters. Findings Here we present a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data. SwiftOrtho uses long k-mers to speed up homology search, while using a reduced amino acid alphabet and spaced seeds to compensate for the loss of sensitivity due to long k-mers. In addition, it uses an affinity propagation algorithm to reduce the memory usage when clustering large-scale orthology relationships into orthologous groups. In our tests, SwiftOrtho was the only tool that completed orthology analysis of proteins from 1,760 bacterial genomes on a computer with only 4 GB RAM. Using various standard orthology data sets, we also show that SwiftOrtho has a high accuracy. Conclusions SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low-memory computers. SwiftOrtho is available at https://github.com/Rinoahu/SwiftOrtho

Dynamics of Genome Architecture in Rhizobium sp. Strain NGR234

Journal of Bacteriology ◽

10.1128/jb.184.1.171-176.2002 ◽

2002 ◽

Vol 184 (1) ◽

pp. 171-176 ◽

Cited By ~ 62

Author(s):

Patrick Mavingui ◽

Margarita Flores ◽

Xianwu Guo ◽

Guillermo Dávila ◽

Xavier Perret ◽

...

Keyword(s):

Large Scale ◽

Insertion Sequence ◽

Biological Significance ◽

Genome Architecture ◽

Bacterial Genomes ◽

Symbiotic Plasmid ◽

Sequence Elements ◽

Dynamic Structures ◽

Genome Analyses ◽

Insertion Sequence Elements

ABSTRACT Bacterial genomes are usually partitioned in several replicons, which are dynamic structures prone to mutation and genomic rearrangements, thus contributing to genome evolution. Nevertheless, much remains to be learned about the origins and dynamics of the formation of bacterial alternative genomic states and their possible biological consequences. To address these issues, we have studied the dynamics of the genome architecture in Rhizobium sp. strain NGR234 and analyzed its biological significance. NGR234 genome consists of three replicons: the symbiotic plasmid pNGR234a (536,165 bp), the megaplasmid pNGR234b (>2,000 kb), and the chromosome (>3,700 kb). Here we report that genome analyses of cell siblings showed the occurrence of large-scale DNA rearrangements consisting of cointegrations and excisions between the three replicons. As a result, four new genomic architectures have emerged. Three consisted of the cointegrates between two replicons: chromosome-pNGR234a, chromosome-pNGR234b, and pNGR234a-pNGR234b. The other consisted of a cointegrate of the three replicons (chromosome-pNGR234a-pNGR234b). Cointegration and excision of pNGR234a with either the chromosome or pNGR234b were studied and found to proceed via a Campbell-type mechanism, mediated by insertion sequence elements. We provide evidence showing that changes in the genome architecture did not alter the growth and symbiotic proficiency of Rhizobium derivatives.

SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008439 ◽

2020 ◽

Vol 16 (12) ◽

pp. e1008439

Author(s):

Jennifer Lu ◽

Steven L. Salzberg

Keyword(s):

Large Scale ◽

Analysis Tool ◽

Index Test ◽

Bacterial Genomes ◽

Phylogenetic Groups ◽

Bacterial Phyla ◽

Link Type ◽

Gc Skew ◽

A Genome ◽

Web App

GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT.

A combined approach for genome wide protein function annotation/prediction

Proteome Science ◽

10.1186/1477-5956-11-s1-s1 ◽

2013 ◽

Vol 11 (Suppl 1) ◽

pp. S1 ◽

Cited By ~ 17

Author(s):

Alfredo Benso ◽

Stefano Di Carlo ◽

Hafeez ur Rehman ◽

Gianfranco Politano ◽

Alessandro Savino ◽

...

Keyword(s):

Protein Function ◽

Combined Approach ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide

Protein function annotation using protein domain family resources

Methods ◽

10.1016/j.ymeth.2015.09.029 ◽

2016 ◽

Vol 93 ◽

pp. 24-34 ◽

Cited By ~ 17

Author(s):

Sayoni Das ◽

Christine A. Orengo

Keyword(s):

Protein Function ◽

Protein Domain ◽

Domain Family ◽

Family Resources ◽

Function Annotation ◽

Protein Function Annotation ◽

Protein Domain Family

The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

BMC Genomics ◽

10.1186/1471-2164-9-s2-s2 ◽

2008 ◽

Vol 9 (Suppl 2) ◽

pp. S2 ◽

Cited By ~ 24

Author(s):

Inbal Halperin ◽

Dariya S Glazer ◽

Shirley Wu ◽

Russ B Altman

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Novel Applications

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

BMC Bioinformatics ◽

10.1186/1471-2105-9-52 ◽

2008 ◽

Vol 9 (1) ◽

pp. 52 ◽

Cited By ~ 23

Author(s):

Chenggang Yu ◽

Nela Zavaljevski ◽

Valmik Desai ◽

Seth Johnson ◽

Fred J Stevens ◽

...

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide ◽

Automated Pipeline

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining ◽

10.1145/3447548.3467163 ◽

2021 ◽

Author(s):

David Dohan ◽

Andreea Gane ◽

Maxwell L. Bileschi ◽

David Belanger ◽

Lucy Colwell

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation

Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning

Briefings in Bioinformatics ◽

10.1093/bib/bbz081 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1437-1447 ◽

Cited By ~ 19

Author(s):

Jiajun Hong ◽

Yongchao Luo ◽

Yang Zhang ◽

Junbiao Ying ◽

Weiwei Xue ◽

...

Keyword(s):

Deep Learning ◽

False Discovery Rate ◽

Protein Function ◽

Functional Annotation ◽

De Novo ◽

Learning Algorithm ◽

Function Annotation ◽

False Discovery ◽

Protein Function Annotation ◽

Annotation Accuracy

Abstract Functional annotation of protein sequence with high accuracy has become one of the most important issues in modern biomedical studies, and computational approaches of significantly accelerated analysis process and enhanced accuracy are greatly desired. Although a variety of methods have been developed to elevate protein annotation accuracy, their ability in controlling false annotation rates remains either limited or not systematically evaluated. In this study, a protein encoding strategy, together with a deep learning algorithm, was proposed to control the false discovery rate in protein function annotation, and its performances were systematically compared with that of the traditional similarity-based and de novo approaches. Based on a comprehensive assessment from multiple perspectives, the proposed strategy and algorithm were found to perform better in both prediction stability and annotation accuracy compared with other de novo methods. Moreover, an in-depth assessment revealed that it possessed an improved capacity of controlling the false discovery rate compared with traditional methods. All in all, this study not only provided a comprehensive analysis on the performances of the newly proposed strategy but also provided a tool for the researcher in the fields of protein function annotation.

Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Bioinformatics ◽

10.1093/bioinformatics/btv398 ◽

2015 ◽

Vol 31 (21) ◽

pp. 3460-3467 ◽

Cited By ~ 43

Author(s):

Sayoni Das ◽

David Lee ◽

Ian Sillitoe ◽

Natalie L. Dawson ◽

Jonathan G. Lees ◽

...

Keyword(s):

Protein Function ◽

Functional Classification ◽

Function Annotation ◽

Protein Function Annotation