Protein length distribution is remarkably consistent across Life

AbstractIn every living species, the function of a protein depends on its organisation of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species. Here we evaluated this diversity by comparing protein length distribution across 2,326 species (1,688 bacteria, 153 archaea and 485 eukaryotes). We found that proteins tend to be on average slightly longer in eukaryotes than in bacteria or archaea, but that the variation of length distribution across species is low, especially compared to the variation of other genomic features (genome size, number of proteins, gene length, GC content, isoelectric points of proteins). Moreover, most cases of atypical protein length distribution appear to be due to artifactual gene annotation, suggesting the actual variation of protein length distribution across species is even smaller. These results open the way for developing a genome annotation quality metric based on protein length distribution to complement conventional quality measures. Overall, our findings show that protein length distribution between living species is more consistent than previously thought, and provide evidence for a universal purifying selection on protein length, whose mechanism and fitness effect remain intriguing open questions.

Download Full-text

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

10.1101/115345 ◽

2017 ◽

Author(s):

Megan J. Bowman ◽

Jane A. Pulman ◽

Tiffany L. Liu ◽

Kevin L. Childs

Keyword(s):

Oryza Sativa ◽

Gene Annotation ◽

Gene Prediction ◽

Biological Significance ◽

Gc Content ◽

Training Data ◽

Structural Annotation ◽

Gene Variation ◽

A Genome ◽

Grass Genomes

AbstractAccurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. We find that gene prediction programs trained on genes with random GC content do not completely predict all grass genes with extreme GC content. We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa.

Download Full-text

Novel metrics for quantifying bacterial genome composition skews

10.1101/176370 ◽

2017 ◽

Author(s):

Lena M. Joesch-Cohen ◽

Max Robinson ◽

Neda Jabbari ◽

Christopher Lausted ◽

Gustavo Glusman

Keyword(s):

Gene Annotation ◽

Bacterial Species ◽

Bacterial Genome ◽

Gc Content ◽

Bacterial Genomes ◽

Genome Composition ◽

Single Genome ◽

A Genome ◽

Dna Strands ◽

Interactive Visualizations

AbstractBackgroundBacterial genomes have characteristic compositional skews, which are differences in nucleotide frequency between the leading and lagging DNA strands across a segment of a genome. It is thought that these strand asymmetries arise as a result of mutational biases and selective constraints, particularly for energy efficiency. Analysis of compositional skews in a diverse set of bacteria provides a comparative context in which mutational and selective environmental constraints can be studied. These analyses typically require finished and well-annotated genomic sequences.ResultsWe present three novel metrics for examining genome composition skews; all three metrics can be computed for unfinished or partially-annotated genomes. The first two metrics, (dot-skew and cross-skew) depend on sequence and gene annotation of a single genome, while the third metric (residual skew) highlights unusual genomes by subtracting a GC content-based model of a library of genome sequences. We applied these metrics to all 7738 available bacterial genomes, including partial drafts, and identified outlier species. A number of these outliers (i.e., Borrelia, Ehrlichia, Kinetoplastibacterium, and Phytoplasma) display similar skew patterns despite only distant phylogenetic relationship. While unrelated, some of the outlier bacterial species share lifestyle characteristics, in particular intracellularity and biosynthetic dependence on their hosts.ConclusionsOur novel metrics appear to reflect the effects of biosynthetic constraints and adaptations to life within one or more hosts on genome composition. We provide results for each analyzed genome, software and interactive visualizations at http://db.systemsbiology.net/gestalt/skew_metrics.

Download Full-text

Chromosome-level genome assembly and transcriptome- based annotation of the oleaginous yeast Rhodotorula toruloides CBS 14

10.1101/2021.04.09.439123 ◽

2021 ◽

Author(s):

Giselle De La Caridad Martin Hernandez ◽

Bettina Muller ◽

Mikolaj Chmielarz ◽

Christian Brandt ◽

Martin Hoelzer ◽

...

Keyword(s):

Oleaginous Yeast ◽

Gene Annotation ◽

Gc Content ◽

Lipid Synthesis ◽

Growth Conditions ◽

Specific Gene ◽

Molecular Physiology ◽

Total Size ◽

A Genome ◽

Genome Draft

Rhodotorula toruloides is an oleaginous yeast with high biotechnological potential. In order to understand the molecular physiology of lipid synthesis in R. toruloides and to advance metabolic engineering, a high-resolution genome is required. We constructed a genome draft of R. toruloides CBS 14, using a hybrid assembly approach, consisting of short and long reads generated by Illumina and Nanopore sequencing, respectively. The genome draft consists of 23 contigs and 3 scaffolds, with a N50 length of 1,529,952 bp, thus largely representing chromosomal organization. The total size is 20,534,857 bp with a GC content of 61.83%. Transcriptomic data from different growth conditions was used to aid species-specific gene annotation. In total we annotated 9,464 genes and identified 11,691 transcripts. Furthermore, we demonstrated the presence of a potential plasmid, an extrachromosomal circular structure of about 11 kb with a copy number about three times as high as the other chromosomes.

Download Full-text

Comparative Genomics of Xanthomonaseuroxanthea and Xanthomonas arboricola pv. juglandis Strains Isolated from a Single Walnut Host Tree

Microorganisms ◽

10.3390/microorganisms9030624 ◽

2021 ◽

Vol 9 (3) ◽

pp. 624

Author(s):

Camila Fernandes ◽

Leonor Martins ◽

Miguel Teixeira ◽

Jochen Blom ◽

Joël F. Pothier ◽

...

Keyword(s):

Extracellular Enzymes ◽

Walnut Tree ◽

Size Number ◽

A Genome ◽

Type 3 Secretion System ◽

Xanthomonas Arboricola ◽

Chromosomal Sequences ◽

Related Proteins ◽

Type 3 ◽

Genome Comparisons

The recent report of distinct Xanthomonas lineages of Xanthomonas arboricola pv. juglandis and Xanthomonas euroxanthea within the same walnut tree revealed that this consortium of walnut-associated Xanthomonas includes both pathogenic and nonpathogenic strains. As the implications of this co-colonization are still poorly understood, in order to unveil niche-specific adaptations, the genomes of three X. euroxanthea strains (CPBF 367, CPBF 424T, and CPBF 426) and of an X. arboricola pv. juglandis strain (CPBF 427) isolated from a single walnut tree in Loures (Portugal) were sequenced with two different technologies, Illumina and Nanopore, to provide consistent single scaffold chromosomal sequences. General genomic features showed that CPBF 427 has a genome similar to other X. arboricola pv. juglandis strains, regarding its size, number, and content of CDSs, while X. euroxanthea strains show a reduction regarding these features comparatively to X. arboricola pv. juglandis strains. Whole genome comparisons revealed remarkable genomic differences between X. arboricola pv. juglandis and X. euroxanthea strains, which translates into different pathogenicity and virulence features, namely regarding type 3 secretion system and its effectors and other secretory systems, chemotaxis-related proteins, and extracellular enzymes. Altogether, the distinct genomic repertoire of X. euroxanthea may be particularly useful to address pathogenicity emergence and evolution in walnut-associated Xanthomonas.

Download Full-text

The genome sequence of the European peacock butterfly, Aglais io (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17204.1 ◽

2021 ◽

Vol 6 ◽

pp. 258

Author(s):

Konrad Lohse ◽

Alexander Mackintosh ◽

Roger Vila ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome ◽

Inachis Io

We present a genome assembly from an individual male Aglais io (also known as Inachis io and Nymphalis io) (the European peacock; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority (99.91%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 11,420 protein coding genes.

Download Full-text

Local genic base composition impacts protein production and cellular fitness

PeerJ ◽

10.7717/peerj.4286 ◽

2018 ◽

Vol 6 ◽

pp. e4286 ◽

Cited By ~ 2

Author(s):

Erik M. Quandt ◽

Charles C. Traverse ◽

Howard Ochman

Keyword(s):

Escherichia Coli ◽

Base Composition ◽

Protein Production ◽

Purifying Selection ◽

Specific Protein ◽

Compositional Variation ◽

Terminal Portion ◽

A Genome ◽

Selection For ◽

Protein Feature

The maintenance of a G + C content that is higher than the mutational input to a genome provides support for the view that selection serves to increase G + C contents in bacteria. Recent experimental evidence fromEscherichia colidemonstrated that selection for increasing G + C content operates at the level of translation, but the precise mechanism by which this occurs is unknown. To determine the substrate of selection, we asked whether selection on G + C content acts across all sites within a gene or is confined to particular genic regions or nucleotide positions. We systematically altered the G + C contents of the GFP gene and assayed its effects on the fitness of strains harboring each variant. Fitness differences were attributable to the base compositional variation in the terminal portion of the gene, suggesting a connection to the folding of a specific protein feature. Variants containing sequence features that are thought to result in rapid translation, such as low G + C content and high levels of codon adaptation, displayed highly reduced growth rates. Taken together, our results show that purifying selection acting against A and T mutations most likely results from their tendency to increase the rate of translation, which can perturb the dynamics of protein folding.

Download Full-text

Comparative Genomics Reveals the Core Gene Toolbox for the Fungus-Insect Symbiosis

mBio ◽

10.1128/mbio.00636-18 ◽

2018 ◽

Vol 9 (3) ◽

Cited By ~ 5

Author(s):

Yan Wang ◽

Matt Stata ◽

Wei Wang ◽

Jason E. Stajich ◽

Merlin M. White ◽

...

Keyword(s):

Entomopathogenic Fungi ◽

Gc Content ◽

Core Gene ◽

Whole Genome ◽

Black Flies ◽

Genome Sequences ◽

Free Living ◽

Genomic Features ◽

Insect Pathogens ◽

Insect Gut

ABSTRACTModern genomics has shed light on many entomopathogenic fungi and expanded our knowledge widely; however, little is known about the genomic features of the insect-commensal fungi. Harpellales are obligate commensals living in the digestive tracts of disease-bearing insects (black flies, midges, and mosquitoes). In this study, we produced and annotated whole-genome sequences of nine Harpellales taxa and conducted the first comparative analyses to infer the genomic diversity within the members of the Harpellales. The genomes of the insect gut fungi feature low (26% to 37%) GC content and large genome size variations (25 to 102 Mb). Further comparisons with insect-pathogenic fungi (from both Ascomycota and Zoopagomycota), as well as with free-living relatives (as negative controls), helped to identify a gene toolbox that is essential to the fungus-insect symbiosis. The results not only narrow the genomic scope of fungus-insect interactions from several thousands to eight core players but also distinguish host invasion strategies employed by insect pathogens and commensals. The genomic content suggests that insect commensal fungi rely mostly on adhesion protein anchors that target digestive system, while entomopathogenic fungi have higher numbers of transmembrane helices, signal peptides, and pathogen-host interaction (PHI) genes across the whole genome and enrich genes as well as functional domains to inactivate the host inflammation system and suppress the host defense. Phylogenomic analyses have revealed that genome sizes of Harpellales fungi vary among lineages with an integer-multiple pattern, which implies that ancient genome duplications may have occurred within the gut of insects.IMPORTANCEInsect guts harbor various microbes that are important for host digestion, immune response, and disease dispersal in certain cases. Bacteria, which are among the primary endosymbionts, have been studied extensively. However, fungi, which are also frequently encountered, are poorly known with respect to their biology within the insect guts. To understand the genomic features and related biology, we produced the whole-genome sequences of nine gut commensal fungi from disease-bearing insects (black flies, midges, and mosquitoes). The results show that insect gut fungi tend to have low GC content across their genomes. By comparing these commensals with entomopathogenic and free-living fungi that have available genome sequences, we found a universal core gene toolbox that is unique and thus potentially important for the insect-fungus symbiosis. This comparative work also uncovered different host invasion strategies employed by insect pathogens and commensals, as well as a model system to study ancient fungal genome duplication within the gut of insects.

Download Full-text

Draft Genome Sequence of Lactobacillus rhamnosus OSU-PECh-69, a Cheese Isolate with Antibacterial Activity

Microbiology Resource Announcements ◽

10.1128/mra.00803-20 ◽

2020 ◽

Vol 9 (37) ◽

Author(s):

Israel García-Cano ◽

Walaa E. Hussein ◽

Diana Rocha-Mendoza ◽

Ahmed E. Yousef ◽

Rafael Jiménez-Flores

Keyword(s):

Genome Sequence ◽

Antimicrobial Agents ◽

Draft Genome ◽

Lactobacillus Rhamnosus ◽

Gc Content ◽

Gene Clusters ◽

Gram Negative Bacteria ◽

The Novel ◽

Content Type ◽

A Genome

ABSTRACT The novel strain Lactobacillus rhamnosus OSU-PECh-69 was isolated from provolone cheese. It produces antimicrobial agents having a molecular mass of 5 to 10 kDa that are active against Gram-positive and Gram-negative bacteria. The strain has a genome sequence of 3,057,669 bp, a GC content of 46.6%, and up to two gene clusters encoding bacteriocins.

Download Full-text

Genome expansion by allopolyploidization in the fungal strain Coniochaeta 2T2.1 and its exceptional lignocellulolytic machinery

Biotechnology for Biofuels ◽

10.1186/s13068-019-1569-6 ◽

2019 ◽

Vol 12 (1) ◽

Cited By ~ 2

Author(s):

Stephen J. Mondo ◽

Diego Javier Jiménez ◽

Ronald E. Hector ◽

Anna Lipzen ◽

Mi Yan ◽

...

Keyword(s):

Wheat Straw ◽

Purifying Selection ◽

Reticulate Evolution ◽

Phylogenomic Analysis ◽

Lignocellulolytic Enzymes ◽

Genome Expansion ◽

Lack Of Information ◽

Genome Wide ◽

A Genome ◽

Furanic Compounds

Abstract Background Particular species of the genus Coniochaeta (Sordariomycetes) exhibit great potential for bioabatement of furanic compounds and have been identified as an underexplored source of novel lignocellulolytic enzymes, especially Coniochaeta ligniaria. However, there is a lack of information about their genomic features and metabolic capabilities. Here, we report the first in-depth genome/transcriptome survey of a Coniochaeta species (strain 2T2.1). Results The genome of Coniochaeta sp. strain 2T2.1 has a size of 74.53 Mbp and contains 24,735 protein-encoding genes. Interestingly, we detected a genome expansion event, resulting ~ 98% of the assembly being duplicated with 91.9% average nucleotide identity between the duplicated regions. The lack of gene loss, as well as the high divergence and strong genome-wide signatures of purifying selection between copies indicates that this is likely a recent duplication, which arose through hybridization between two related Coniochaeta-like species (allopolyploidization). Phylogenomic analysis revealed that 2T2.1 is related Coniochaeta sp. PMI546 and Lecythophora sp. AK0013, which both occur endophytically. Based on carbohydrate-active enzyme (CAZy) annotation, we observed that even after in silico removal of its duplicated content, the 2T2.1 genome contains exceptional lignocellulolytic machinery. Moreover, transcriptomic data reveal the overexpression of proteins affiliated to CAZy families GH11, GH10 (endoxylanases), CE5, CE1 (xylan esterases), GH62, GH51 (α-l-arabinofuranosidases), GH12, GH7 (cellulases), and AA9 (lytic polysaccharide monoxygenases) when the fungus was grown on wheat straw compared with glucose as the sole carbon source. Conclusions We provide data that suggest that a recent hybridization between the genomes of related species may have given rise to Coniochaeta sp. 2T2.1. Moreover, our results reveal that the degradation of arabinoxylan, xyloglucan and cellulose are key metabolic processes in strain 2T2.1 growing on wheat straw. Different genes for key lignocellulolytic enzymes were identified, which can be starting points for production, characterization and/or supplementation of enzyme cocktails used in saccharification of agricultural residues. Our findings represent first steps that enable a better understanding of the reticulate evolution and “eco-enzymology” of lignocellulolytic Coniochaeta species.

Download Full-text

GPRED-GC: a Gene PREDiction model accounting for 5 ′- 3′ GC gradient

BMC Bioinformatics ◽

10.1186/s12859-019-3047-3 ◽

2019 ◽

Vol 20 (S15) ◽

Cited By ~ 1

Author(s):

Prapaporn Techa-Angkoon ◽

Kevin L. Childs ◽

Yanni Sun

Keyword(s):

Ab Initio ◽

Gene Annotation ◽

Gene Prediction ◽

Source Code ◽

Gc Content ◽

Prediction Tools ◽

Homologous Sequences ◽

Manual Intervention ◽

Grass Genomes ◽

Gc Contents

Abstract Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.

Download Full-text