Absent from DNA and protein: genomic characterization of nullomers and nullpeptides across functional categories and evolution

AbstractNullomers and nullpeptides are short DNA or amino acid sequences that are absent from a genome or proteome, respectively. One potential cause for their absence could be that they have a detrimental impact on an organism. Here, we identified all possible nullomers and nullpeptides in the genomes and proteomes of over thirty species and show that a significant proportion of these sequences are under negative selection. We assign nullomers to different functional categories (coding sequences, exons, introns, 5’UTR, 3’UTR and promoters) and show that nullomers from coding sequences and promoters are most likely to be selected against. Utilizing variants in the human population, we annotate variant-associated nullomers, highlighting their potential use as DNA ‘fingerprints’. Phylogenetic analyses of nullomers and nullpeptides across evolution shows that they could be used to build phylogenetic trees. Our work provides a catalog of genomic and proteome derived absent k-mers, together with a novel scoring function to determine their potential functional importance. In addition, it shows how these unique sequences could be used as DNA ‘fingerprints’ or for phylogenetic analyses.

Download Full-text

Absent from DNA and protein: genomic characterization of nullomers and nullpeptides across functional categories and evolution

Genome Biology ◽

10.1186/s13059-021-02459-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ilias Georgakopoulos-Soares ◽

Ofer Yizhar-Barnea ◽

Ioannis Mouratidis ◽

Martin Hemberg ◽

Nadav Ahituv

Keyword(s):

Significant Proportion ◽

Amino Acid Sequences ◽

Human Populations ◽

Functional Categories ◽

Naturally Occurring ◽

A Genome ◽

Genomic Characterization ◽

Single Base Pair ◽

Detrimental Impact

Abstract Nullomers and nullpeptides are short DNA or amino acid sequences that are absent from a genome or proteome, respectively. One potential cause for their absence could be their having a detrimental impact on an organism. Results Here, we identify all possible nullomers and nullpeptides in the genomes and proteomes of thirty eukaryotes and demonstrate that a significant proportion of these sequences are under negative selection. We also identify nullomers that are unique to specific functional categories: coding sequences, exons, introns, 5′UTR, 3′UTR, promoters, and show that coding sequence and promoter nullomers are most likely to be selected against. By analyzing all protein sequences across the tree of life, we further identify 36,081 peptides up to six amino acids in length that do not exist in any known organism, termed primes. We next characterize all possible single base pair mutations that can lead to the appearance of a nullomer in the human genome, observing a significantly higher number of mutations than expected by chance for specific nullomer sequences in transposable elements, likely due to their suppression. We also annotate nullomers that appear due to naturally occurring variants and show that a subset of them can be used to distinguish between different human populations. Analysis of nullomers and nullpeptides across vertebrate evolution shows they can also be used as phylogenetic classifiers. Conclusions We provide a catalog of nullomers and nullpeptides in distinct functional categories, develop methods to systematically study them, and highlight the use of variability in these sequences in other analyses

Download Full-text

Identification and Molecular Characterization of Nuclear Citrus leprosis virus, a Member of the Proposed Dichorhavirus Genus Infecting Multiple Citrus Species in Mexico

Phytopathology ◽

10.1094/phyto-09-14-0245-r ◽

2015 ◽

Vol 105 (4) ◽

pp. 564-575 ◽

Cited By ~ 19

Author(s):

Avijit Roy ◽

Andrew L. Stone ◽

Jonathan Shao ◽

Gabriel Otero-Colina ◽

Gang Wei ◽

...

Keyword(s):

Genome Sequence ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Infected Plant ◽

Amino Acid Sequences ◽

N Gene ◽

Infected Leaf ◽

N Protein ◽

A Genome ◽

Citrus Leprosis

Citrus leprosis is one of the most destructive diseases of Citrus spp. and is associated with two unrelated virus groups that produce particles primarily in either the cytoplasm or nucleus of infected plant cells. Symptoms of leprosis, including chlorotic spots surrounded by yellow haloes on leaves and necrotic spots on twigs and fruit, were observed on leprosis-affected mandarin and navel sweet orange trees in the state of Querétaro, Mexico. Serological and molecular assays showed that the cytoplasmic types of Citrus leprosis virus (CiLV-C) often associated with leprosis symptomatic tissues were absent. However, using transmission electron microscopy, bullet-shaped rhabdovirus-like virions were observed in the nuclei and cytoplasm of the citrus leprosis-infected leaf tissues. An analysis of small RNA populations from symptomatic tissue was carried out to determine the genome sequence of the rhabdovirus-like particles observed in the citrus leprosis samples. The complete genome sequence showed that the nuclear type of CiLV (CiLV-N) present in the samples consisted of two negative-sense RNAs: 6,268-nucleotide (nt)-long RNA1 and 5,847-nt-long RNA2, excluding the poly(A) tails. CiLV-N had a genome organization identical to that of Orchid fleck virus (OFV), with the exception of shorter 5′ untranslated regions in RNA1 (53 versus 205 nt) and RNA2 (34 versus 182 nt). Phylogenetic trees constructed with the amino acid sequences of the nucleocapsid (N) and glycoproteins (G) and the RNA polymerase (L protein) showed that CiLV-N clusters with OFV. Furthermore, phylogenetic analyses of N protein established CiLV-N as a member of the proposed genus Dichorhavirus. Reverse-transcription polymerase chain reaction primers for the detection of CiLV-N were designed based on the sequence of the N gene and the assay was optimized and tested to detect the presence of CiLV-N in both diseased and symptom-free plants.

Download Full-text

Phylogenetic analysis of canine parvoviruses from Turkey

Medycyna Weterynaryjna ◽

10.21521/mw.6334 ◽

2020 ◽

Vol 76 (01) ◽

pp. 6334-2020

Author(s):

ZEYNEP AKKUTAY-YOLDAR ◽

TAYLAN KOÇ B.

Keyword(s):

High Mortality ◽

Phylogenetic Trees ◽

Clinical Symptoms ◽

Phylogenetic Analyses ◽

Canine Parvovirus ◽

Mortality And Morbidity ◽

Identity Match ◽

Genomic Characterization

Canine parvovirus (CPV) type 2 is the causative agent of acute hemorrhagic enteritis and high mortality in the affected dogs. Numerous studies have been done to understand the origin of the virus and to exhibit new variants and circulating strains. This report describes the detection and genomic characterization of CPV strains from indoor and outdoor dogs in Ankara, Turkey. Samples were sent to our laboratory due to clinical symptoms in puppies. We tested blood and swab samples to determine the presence of canine parvovirus (CPV) in three puppies and two adult dogs by reverse transcription-polymerase chain reaction (RT-PCR) using VP2 (capsid protein) region primers of canine parvoviruses. Following that, to provide molecular characterization data Maximum Likelihood (ML) method was used for phylogenetic analyses. Constructed phylogenetic trees from the aligned nucleotide sequences revealed that our CPV strains demonstrated high genetic similarities, with 100% identity match on nucleotide alignments with each other and classified in CPV-2b genotypes.They have placed on a monophyletic clade as a sister branch with CPV VAC S quantum with 98.9% nucleotide homology. Our findings suggest that CPV-2b is actual and frequently seen variant in Turkey and shows high similarities with other CPV variants and a bit less with FPVs in Turkey and around the world. CPV causes high mortality and morbidity in dogs and to develop effective vaccines for protection of dogs in Turkey where there are few numbers of studies that have been done, field strains should be isolated and characterised.

Download Full-text

Phylogenetic analysis of human rhinovirus capsid protein VP1 and 2A protease coding sequences confirms shared genus-like relationships with human enteroviruses

Journal of General Virology ◽

10.1099/vir.0.80445-0 ◽

2005 ◽

Vol 86 (3) ◽

pp. 697-706 ◽

Cited By ~ 50

Author(s):

Pia Laine ◽

Carita Savolainen ◽

Soile Blomqvist ◽

Tapani Hovi

Keyword(s):

Phylogenetic Analysis ◽

Amino Acid ◽

Capsid Protein ◽

Phylogenetic Trees ◽

Amino Acid Sequences ◽

Human Rhinovirus ◽

Human Enterovirus ◽

Coding Region ◽

Coding Sequences ◽

Capsid Protein Vp1

Phylogenetic analysis of the capsid protein VP1 coding sequences of all 101 human rhinovirus (HRV) prototype strains revealed two major genetic clusters, similar to that of the previously reported VP4/VP2 coding sequences, representing the established two species, Human rhinovirus A (HRV-A) and Human rhinovirus B (HRV-B). Pairwise nucleotide identities varied from 61 to 98 % within and from 46 to 55 % between the two HRV species. Interserotypic sequence identities in both HRV species were more variable than those within any Human enterovirus (HEV) species in the same family. This means that unequivocal serotype identification by VP1 sequence analysis used for HEV strains may not always be possible for HRV isolates. On the other hand, a comprehensive insight into the relationships between VP1 and partial 2A sequences of HRV and HEV revealed a genus-like situation. Distribution of pairwise nucleotide identity values between these genera varied from 41 to 54 % in the VP1 coding region, similar to those between heterologous members of the two HRV species. Alignment of the deduced amino acid sequences revealed more fully conserved amino acid residues between HRV-B and polioviruses than between the two HRV species. In phylogenetic trees, where all HRVs and representatives from all HEV species were included, the two HRV species did not cluster together but behaved like members of the same genus as the HEVs. In conclusion, from a phylogenetic point of view, there are no good reasons to keep these two human picornavirus genera taxonomically separated.

Download Full-text

Genomic Characterization and Distribution Pattern of a Novel Marine OM43 Phage

Frontiers in Microbiology ◽

10.3389/fmicb.2021.651326 ◽

2021 ◽

Vol 12 ◽

Author(s):

Mingyu Yang ◽

Qian Xia ◽

Sen Du ◽

Zefeng Zhang ◽

Fang Qin ◽

...

Keyword(s):

Distribution Pattern ◽

Sequence Similarity ◽

Phylogenetic Analyses ◽

Genomic Diversity ◽

Comparative Genomic ◽

Phage Group ◽

C1 Metabolism ◽

Subgroup Ii ◽

A Genome ◽

Genomic Characterization

Bacteriophages have a significant impact on the structure and function of marine microbial communities. Phages of some major bacterial lineages have recently been shown to dominate the marine viral communities. However, phages that infect many important bacterial clades still remained unexplored. Members of the marine OM43 clade are methylotrophs that play important roles in C1 metabolism. OM43 phages (phages that infect the OM43 bacteria) represent an understudied viral group with only one known isolate. In this study, we describe the genomic characterization and biogeography of an OM43 phage that infects the strain HTCC2181, designated MEP301. MEP301 has a genome size of 34,774 bp. We found that MEP301 is genetically distinct from other known phage isolates and only displays significant sequence similarity with some metagenomic viral genomes (MVGs). A total of 12 MEP301-type MVGs were identified from metagenomic datasets. Comparative genomic and phylogenetic analyses revealed that MEP301-type phages can be separated into two subgroups (subgroup I and subgroup II). We also performed a metagenomic recruitment analysis to determine the relative abundance of reads mapped to these MEP301-type phages, which suggested that subgroup I MEP301-type phages are present predominantly in the cold upper waters with lower salinity. Notably, subgroup II phages have an inverse different distribution pattern, implying that they may infect hosts from a distinct OM43 subcluster. Our study has expanded the knowledge about the genomic diversity of marine OM43 phages and identified a new phage group that is widespread in the ocean.

Download Full-text

The complete mitochondrial genome of the Eurasian wryneck Jynx torquilla (Aves: Piciformes: Picidae) and its phylogenetic inference

Zootaxa ◽

10.11646/zootaxa.4810.2.8 ◽

2020 ◽

Vol 4810 (2) ◽

pp. 351-360

Author(s):

CHAO DU ◽

LI LIU ◽

YUNPENG LIU ◽

ZHAOHUI FU

Keyword(s):

Mitochondrial Genome ◽

Control Region ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Complete Mitochondrial Genome ◽

Amino Acid Sequences ◽

Protein Coding ◽

Jynx Torquilla ◽

Monophyletic Lineage ◽

Phylogeny And Evolution

The Eurasian Wryneck is a species of wryneck woodpecker breeding in temperate regions of Europe and Asia. We sequenced the mitochondrial genome of Jynx torquilla (Aves, Piciformes, Picidae) using the next generation sequencing. The circular genome is 16,832 bp long, encoding 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNAs), two ribosomal RNAs (rRNAs), and two control regions. Gene order and orientation are similar to the most common type suggested as ancestral for birds but have a 1,221 bp control region and a 60 bp remnant control region. Phylogenetic analyses of 17 piciform taxa, based on both nucleotide and amino acid sequences of mitochondrial PCGs, strongly support the monophyly of Picidae. All phylogenetic trees indicate that the subfamily Jynginae is a monophyletic lineage sister to other woodpeckers, including monophyletic Picinae. Only the Bayes inferred tree based on the nucleotide dataset, recovered Picumninae as monophyletic. These findings will be helpful for the understanding of the phylogeny and evolution of Picidae.

Download Full-text

A short guide to phylogeny reconstruction

Plant Soil and Environment ◽

10.17221/2194-pse ◽

2008 ◽

Vol 53 (No. 10) ◽

pp. 442-446 ◽

Cited By ~ 5

Author(s):

E. Michu

Keyword(s):

Phylogenetic Analysis ◽

Amino Acid ◽

Dna Sequences ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Amino Acid Sequences ◽

Short Introduction ◽

Closely Related Species ◽

Origin And Evolution ◽

Anatomical Characters

This review is a short introduction to phylogenetic analysis. Phylogenetic analysis allows comprehensive understanding of the origin and evolution of species. Generally, it is possible to construct the phylogenetic trees according to different features and characters (e.g. morphological and anatomical characters, RAPD patterns, FISH patterns, sequences of DNA/RNA and amino acid sequences). The DNA sequences are preferable for phylogenetic analyses of closely related species. On the other hand, the amino acid sequences are used for phylogenetic analyses of more distant relationships. The sequences can be analysed using many computer programs. The methods most often used for phylogenetic analyses are neighbor-joining (NJ), maximum parsimony (MP), maximum likelihood (ML) and Bayesian inference.

Download Full-text

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

PeerJ Computer Science ◽

10.7717/peerj-cs.118 ◽

2017 ◽

Vol 3 ◽

pp. e118 ◽

Cited By ~ 10

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J. O’Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Protein Coding ◽

Coding Sequences ◽

A Genome ◽

Pressure Analysis

Background Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA.

Download Full-text

A proteomic analysis of peanut seed at different stages of underground development to understand the changes of seed proteins

PLoS ONE ◽

10.1371/journal.pone.0243132 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243132

Author(s):

Haifen Li ◽

Xuanqiang Liang ◽

Baojin Zhou ◽

Xiaoping Chen ◽

Yanbin Hong ◽

...

Keyword(s):

Amino Acid ◽

Profile Analysis ◽

Time Of Flight ◽

Seed Proteins ◽

Hierarchical Cluster ◽

Amino Acid Sequences ◽

Pcr Analysis ◽

Functional Categories ◽

Flight Mass Spectrometry ◽

A Genome

In order to obtain more valuable insights into the protein dynamics and accumulation of allergens in seeds during underground development, we performed a proteomic study on developing peanut seeds at seven different stages. A total of 264 proteins with altered abundance and contained at least one unique peptide was detected by matrix-assisted laser desorption ionization time-of-flight/time-of-flight mass spectrometry (MALDI-TOF/TOF MS). All identified proteins were classified into five functional categories as level 1 and 20 secondary functional categories as level 2. Among them, 88 identified proteins (IPs) were related to carbohydrate/ amino acid/ lipid transport and metabolism, indicating that carbohydrate/amino acid/ lipid metabolism played a key role in the underground development of peanut seeds. Hierarchical cluster analysis showed that all IPs could be classified into eight cluster groups according to the abundance profiles, suggesting that the modulatory patterns of these identified proteins were complicated during seed development. The largest group contained 41 IPs, the expression of which decreased at R 2 and reached a maximum at R3 but gradually decreased from R4. A total of 14 IPs were identified as allergen-like proteins by BLAST with A genome (Arachis duranensis) or B genome (Arachis ipaensis) translated allergen sequences. Abundance profile analysis of 14 identified allergens showed that the expression of all allergen proteins was low or undetectable by 2-DE at the early stages (R1 to R4), and began to accumulate from the R5 stage and gradually increased. Network analysis showed that most of the significant proteins were involved in active metabolic pathways in early development. Real time RT-PCR analysis revealed that transcriptional regulation was approximately consistent with expression at the protein level for 8 selected identified proteins. In addition, some amino acid sequences that may be associated with new allergens were also discussed.

Download Full-text

Genetic Diversity of Enteric Viruses in Children under Five Years Old in Gabon

Viruses ◽

10.3390/v13040545 ◽

2021 ◽

Vol 13 (4) ◽

pp. 545

Author(s):

Gédéon Prince Manouana ◽

Paul Alvyn Nguema-Moure ◽

Mirabeau Mbong Ngwese ◽

C.-Thomas Bock ◽

Peter G. Kremsner ◽

...

Keyword(s):

Genetic Diversity ◽

Preventive Measures ◽

Phylogenetic Analyses ◽

Enteric Viruses ◽

Study Cohort ◽

Viral Agent ◽

A Genome ◽

Stool Samples ◽

Pcr Techniques ◽

High Diversity

Enteric viruses are the leading cause of diarrhea in children globally. Identifying viral agents and understanding their genetic diversity could help to develop effective preventive measures. This study aimed to determine the detection rate and genetic diversity of four enteric viruses in Gabonese children aged below five years. Stool samples from children <5 years with (n = 177) and without (n = 67) diarrhea were collected from April 2018 to November 2019. Norovirus, astrovirus, sapovirus, and aichivirus A were identified using PCR techniques followed by sequencing and phylogenetic analyses. At least one viral agent was identified in 23.2% and 14.9% of the symptomatic and asymptomatic participants, respectively. Norovirus (14.7%) and astrovirus (7.3%) were the most prevalent in children with diarrhea, whereas in the healthy group norovirus (9%) followed by the first reported aichivirus A in Gabon (6%) were predominant. The predominant norovirus genogroup was GII, consisting mostly of genotype GII.P31-GII.4 Sydney. Phylogenetic analysis of the 3CD region of the aichivirus A genome revealed the presence of two genotypes (A and C) in the study cohort. Astrovirus and sapovirus showed a high diversity, with five different astrovirus genotypes and four sapovirus genotypes, respectively. Our findings give new insights into the circulation and genetic diversity of enteric viruses in Gabonese children.

Download Full-text