An integrated multi-level comparison highlights common aspects and specific features between distantly-related species: Tomato and Grapevine

Motivation. Even after years from the first completion of genomes by sequencing, comparative genomics still remains a challenge, also enhanced by the availability of numerous draft genomes with still poor annotation quality. The detection of ortholog genes between different species is a key approach for comparative genomics. For example, ortholog gene detection may support investigations on mechanisms that shaped the organization of the genomes, highlighting on gain or loss of function and on gene annotation. On the other hand, the detection of paralog genes is fundamental for understanding the evolutionary mechanisms that drove gene function innovation and support gene families analyses. Here we report on the gene comparison between two distantly related plants, Solanum lycopersicum (Tomato) (The Tomato Genome Consortium 2012) and Vitis vinifera (Grapevine) (Jaillon et al. 2007), considered as economically important species from asterids and rosids clades, respectively. The strategy was accompanied by integration of multilevel analyses, from domain investigations to expression profiling, to get to the most reliable results and to offer powerful resources, in order to understand different useful aspects of plant evolution and physiology and to dissect traits and molecular aspects that could provide novel tools for agriculture applications and biotechnologies. Methods. In order to predict best putative orthologs and paralogs between Tomato and Grapevine, and to overcome possible annotation issues, all-against-all sequence similarity searches between genes, mRNAs and proteins collections of both species were performed. A Bidirectional Best Hit approach was implemented to detect the best orthologs between the two species. Moreover we developed a dedicated algorithm in Python programming language able to define more extended alignments between mRNA sequences. NetworkX package (Hagberg et al. 2008) was used to define networks of paralogs and orthologs. Proteins domain prediction was carried out on the entire Tomato and Grapevine protein collection by using InterProScan program (Jones et al. 2014). The enzyme classification was obtained by sequence similarity searches between Tomato and Grapevine mRNA collections and the entire UniProt reviewed protein collection (UniProt consortium 2015). The metabolic pathways associated to the detected enzymes were identified exploiting the KEGG Database (Kanehisa and Goto 2000). Expression level of three developmental stages of Tomato (2 cm fruit, breaker and mature red) and the corresponding stages of Grapevine (post-setting, veraison, mature berry) was defined on the basis of the iTAG loci (Shearer et al. 2014) and v1 vitis loci, respectively. The expression was normalized by Reads Per Kilobases per Million (RPKM) for each tissue/stage. Abstract truncated at 3,000 characters - the full version is available in the pdf file

Download Full-text

An integrated multi-level comparison highlights common aspects and specific features between distantly-related species: Tomato and Grapevine

10.7287/peerj.preprints.2208 ◽

2016 ◽

Author(s):

Luca Ambrosino ◽

Hamed Bostan ◽

Valentino Ruggieri ◽

Maria Luisa Chiusano

Keyword(s):

Comparative Genomics ◽

Developmental Stages ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Families ◽

Plant Evolution ◽

Loss Of Function ◽

Evolutionary Mechanisms ◽

Important Species ◽

Similarity Searches

Download Full-text

Hayai-Annotation Plants: an ultra-fast and comprehensive functional gene annotation system in plants

Bioinformatics ◽

10.1093/bioinformatics/btz380 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4427-4429 ◽

Cited By ~ 7

Author(s):

Andrea Ghelfi ◽

Kenta Shirasawa ◽

Hideki Hirakawa ◽

Sachiko Isobe

Keyword(s):

Enzyme Commission ◽

Gene Annotation ◽

Sequence Similarity ◽

Enzyme Commission Number ◽

Functional Gene ◽

Supplementary Information ◽

Annotation System ◽

Evidence Type ◽

Similarity Searches ◽

Functional Gene Annotation

Abstract Summary Hayai-Annotation Plants is a browser-based interface for an ultra-fast and accurate functional gene annotation system for plant species using R. The pipeline combines the sequence-similarity searches, using USEARCH against UniProtKB (taxonomy Embryophyta), with a functional annotation step. Hayai-Annotation Plants provides five layers of annotation: i) protein name; ii) gene ontology terms consisting of its three main domains (Biological Process, Molecular Function and Cellular Component); iii) enzyme commission number; iv) protein existence level; and v) evidence type. It implements a new algorithm that gives priority to protein existence level to propagate GO and EC information and annotated Arabidopsis thaliana representative peptide sequences (Araport11) within 5 min at the PC level. Availability and implementation The software is implemented in R and runs on Macintosh and Linux systems. It is freely available at https://github.com/kdri-genomics/Hayai-Annotation-Plants under the GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification and expression analysis of the LRR-RLK gene family in tomato (Solanum lycopersicum) Heinz 1706

Genome ◽

10.1139/gen-2015-0035 ◽

2015 ◽

Vol 58 (4) ◽

pp. 121-134 ◽

Cited By ~ 22

Author(s):

Zhirong Wei ◽

Jiehua Wang ◽

Shaohui Yang ◽

Yingjin Song

Keyword(s):

Gene Family ◽

Solanum Lycopersicum ◽

Stress Responses ◽

Tandem Duplication ◽

Developmental Stages ◽

Sequence Similarity ◽

Functional Divergence ◽

Fleshy Fruit ◽

Important Species ◽

Preferential Expression

As the largest subfamily of receptor-like kinases (RLKs), leucine-rich repeat receptor-like kinases (LRR-RLKs) regulate the growth, development, and stress responses of plants. Through a reiterative process of sequence analysis and re-annotation, 234 LRR-RLK genes were identified in the genome of tomato (Solanum lycopersicum) ‘Heinz 1706’, which were further grouped into 10 major groups based on their sequence similarity. In comparison to the significant role of tandem duplication in the expansion process of this gene family in other species, only approximately 12% (29 out of 234) of SlLRR-RLK genes arose from tandem duplication. Using the multiple expectation maximization for motif elicitation (MEME) method, the motif composition and arrangement were found to be variably conserved within each SlLRR-RLK group, indicating their different extent of functional divergence. Expression profiling analyses by qRT-PCR data revealed that SlLRR-RLK genes were differentially expressed in various tomato organs and tissues, and some SlLRR-RLK genes exhibited preferential expression in fruits at distinct developmental stages, suggesting that SlLRR-RLK may take important roles in fruit development and ripening process. The results of this study provide an overview of the LRR-RLK gene family in tomato Heinz 1706, one important species of Solanaceae, and will be helpful for future functional analysis of this important protein family in fleshy fruit-bearing species.

Download Full-text

Rapid protein evolution, organellar reductions, and invasive intronic elements in the marine aerobic parasite dinoflagellate Amoebophrya spp

BMC Biology ◽

10.1186/s12915-020-00927-9 ◽

2021 ◽

Vol 19 (1) ◽

Cited By ~ 2

Author(s):

Sarah Farhat ◽

Phuong Le ◽

Ehsan Kayal ◽

Benjamin Noel ◽

Estelle Bigeard ◽

...

Keyword(s):

Transposable Elements ◽

Protein Evolution ◽

Sequence Similarity ◽

Expression Patterns ◽

Gene Families ◽

Gene Organization ◽

Host Specialization ◽

Evolutionary Mechanisms ◽

Symbiotic Relationships ◽

Intergenic Regions

Abstract Background Dinoflagellates are aquatic protists particularly widespread in the oceans worldwide. Some are responsible for toxic blooms while others live in symbiotic relationships, either as mutualistic symbionts in corals or as parasites infecting other protists and animals. Dinoflagellates harbor atypically large genomes (~ 3 to 250 Gb), with gene organization and gene expression patterns very different from closely related apicomplexan parasites. Here we sequenced and analyzed the genomes of two early-diverging and co-occurring parasitic dinoflagellate Amoebophrya strains, to shed light on the emergence of such atypical genomic features, dinoflagellate evolution, and host specialization. Results We sequenced, assembled, and annotated high-quality genomes for two Amoebophrya strains (A25 and A120), using a combination of Illumina paired-end short-read and Oxford Nanopore Technology (ONT) MinION long-read sequencing approaches. We found a small number of transposable elements, along with short introns and intergenic regions, and a limited number of gene families, together contribute to the compactness of the Amoebophrya genomes, a feature potentially linked with parasitism. While the majority of Amoebophrya proteins (63.7% of A25 and 59.3% of A120) had no functional assignment, we found many orthologs shared with Dinophyceae. Our analyses revealed a strong tendency for genes encoded by unidirectional clusters and high levels of synteny conservation between the two genomes despite low interspecific protein sequence similarity, suggesting rapid protein evolution. Most strikingly, we identified a large portion of non-canonical introns, including repeated introns, displaying a broad variability of associated splicing motifs never observed among eukaryotes. Those introner elements appear to have the capacity to spread over their respective genomes in a manner similar to transposable elements. Finally, we confirmed the reduction of organelles observed in Amoebophrya spp., i.e., loss of the plastid, potential loss of a mitochondrial genome and functions. Conclusion These results expand the range of atypical genome features found in basal dinoflagellates and raise questions regarding speciation and the evolutionary mechanisms at play while parastitism was selected for in this particular unicellular lineage.

Download Full-text

Hayai-Annotation Plants: an ultra-fast and comprehensive gene annotation system in plants

10.1101/473488 ◽

2018 ◽

Cited By ~ 2

Author(s):

Andrea Ghelfi ◽

Kenta Shirasawa ◽

Hideki Hirakawa ◽

Sachiko Isobe

Keyword(s):

Enzyme Commission ◽

Biological Process ◽

Gene Annotation ◽

Sequence Similarity ◽

Enzyme Commission Number ◽

Molecular Function ◽

Annotation System ◽

Evidence Type ◽

Similarity Searches ◽

Speed And Accuracy

SummaryHayai-Annotation Plants is a browser-based interface for an ultra-fast and accurate gene annotation system for plant species using R. The pipeline combines the sequence-similarity searches, using USEARCH against UniProtKB (taxonomy Embryophyta), with a functional annotation step. Hayai-Annotation Plants provides five layers of annotation: 1) gene name; 2) gene ontology terms consisting of its three main domains (Biological Process, Molecular Function, and Cellular Component); 3) enzyme commission number; 4) protein existence level; 5) and evidence type. In regard to speed and accuracy, Hayai-Annotation Plants annotated Arabidopsis thaliana (Araport11, representative peptide sequences) within five minutes with an accuracy of 96.4 %.Availability and ImplementationThe software is implemented in R and runs on Macintosh and Linux systems. It is freely available at https://github.com/kdri-genomics/Hayai-Annotation-Plants under the GPLv3 license.

Download Full-text

Comparative Genomics of Two New HF1-like Haloviruses

Genes ◽

10.3390/genes11040405 ◽

2020 ◽

Vol 11 (4) ◽

pp. 405 ◽

Cited By ~ 1

Author(s):

Mike Dyall-Smith ◽

Sen-Lin Tang ◽

Brendan Russ ◽

Pei-Wen Chiang ◽

Friedhelm Pfeiffer

Keyword(s):

Comparative Genomics ◽

Dna Methyltransferase ◽

Cell Attachment ◽

Gene Annotation ◽

Sequence Similarity ◽

Direct Repeat ◽

Small Proteins ◽

A Genome ◽

Methyltransferase Gene ◽

High Nucleotide Sequence Similarity

Few genomes of the HF1-group of viruses are currently available, and further examples would enhance the understanding of their evolution, improve their gene annotation, and assist in understanding gene function and regulation. Two novel HF1-group haloviruses, Serpecor1 and Hardycor2, were recovered from widely separated hypersaline lakes in Australia. Both are myoviruses with linear dsDNA genomes and infect the haloarchaeon Halorubrum coriense. Both genomes possess long, terminal direct repeat (TDR) sequences (320 bp for Serpecor1 and 306 bp for Hardycor2). The Serpecor1 genome is 74,196 bp in length, 57.0% G+C, and has 126 annotated coding sequences (CDS). Hardycor2 has a genome of 77,342 bp, 55.6% G+C, and 125 annotated CDS. They show high nucleotide sequence similarity to each other (78%) and with HF1 (>75%), and carry similar intergenic repeat (IR) sequences to those originally described in HF1 and HF2. Hardycor2 carries a DNA methyltransferase gene in the same genomic neighborhood as the methyltransferase genes of HF1, HF2 and HRTV-5, but is in the opposite orientation, and the inferred proteins are only distantly related. Comparative genomics allowed us to identify the candidate genes mediating cell attachment. The genomes of Serpecor1 and Hardycor2 encode numerous small proteins carrying one or more CxxC motifs, a signature feature of zinc-finger domain proteins that are known to participate in diverse biomolecular interactions.

Download Full-text

BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies

10.1101/593889 ◽

2019 ◽

Cited By ~ 1

Author(s):

Joel Vizueta ◽

Alejandro Sánchez-Gracia ◽

Julio Rozas

Keyword(s):

Dna Sequences ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Families ◽

Genomic Research ◽

Model Organisms ◽

Large Gene ◽

Genomic Annotation ◽

Gene Models ◽

Genome Assemblies

AbstractGene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of non-model organisms. Despite the recent progress in automatic methods, the tools developed for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, which require considerable extra efforts to be amended. Here we present BITACORA, a bioinformatics solution that integrates sequence similarity search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly from DNA sequences. We tested the performance of the BITACORA pipeline in annotating the members of two chemosensory gene families of different sizes in seven available chelicerate genome drafts. Despite the relatively high fragmentation of some of these drafts, BITACORA was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program generates an output file in the general feature format (GFF) files, with both curated and novel gene models, and a FASTA file with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.

Download Full-text

A Similarity Searching System for Biological Phenotype Images Using Deep Convolutional Encoder-decoder Architecture

Current Bioinformatics ◽

10.2174/1574893614666190204150109 ◽

2019 ◽

Vol 14 (7) ◽

pp. 628-639 ◽

Cited By ~ 10

Author(s):

Bizhi Wu ◽

Hangxiao Zhang ◽

Limei Lin ◽

Huiyuan Wang ◽

Yubang Gao ◽

...

Keyword(s):

Neural Network ◽

Retrieval System ◽

Sequence Similarity ◽

Local Alignment ◽

Similarity Searching ◽

Loss Of Function ◽

Biological Images ◽

The Neural Network ◽

Convolutional Autoencoder ◽

Biological Phenotype

Background: The BLAST (Basic Local Alignment Search Tool) algorithm has been widely used for sequence similarity searching. Analogously, the public phenotype images must be efficiently retrieved using biological images as queries and identify the phenotype with high similarity. Due to the accumulation of genotype-phenotype-mapping data, a system of searching for similar phenotypes is not available due to the bottleneck of image processing. Objective: In this study, we focus on the identification of similar query phenotypic images by searching the biological phenotype database, including information about loss-of-function and gain-of-function. Methods: We propose a deep convolutional autoencoder architecture to segment the biological phenotypic images and develop a phenotype retrieval system to enable a better understanding of genotype–phenotype correlation. Results: This study shows how deep convolutional autoencoder architecture can be trained on images from biological phenotypes to achieve state-of-the-art performance in a phenotypic images retrieval system. Conclusion: Taken together, the phenotype analysis system can provide further information on the correlation between genotype and phenotype. Additionally, it is obvious that the neural network model of image segmentation and the phenotype retrieval system is equally suitable for any species, which has enough phenotype images to train the neural network.

Download Full-text

Identification of Novel Toxin Genes from the Stinging Nettle Caterpillar Parasa lepida (Cramer, 1799): Insights into the Evolution of Lepidoptera Toxins

Insects ◽

10.3390/insects12050396 ◽

2021 ◽

Vol 12 (5) ◽

pp. 396

Author(s):

Natrada Mitpuangchon ◽

Kwan Nualcharoen ◽

Singtoe Boonrotpong ◽

Patamarerk Engsontia

Keyword(s):

Protease Inhibitors ◽

Proteolytic Enzymes ◽

Gene Annotation ◽

Sequence Similarity ◽

New Drugs ◽

Toxin Gene ◽

Cone Snail ◽

Stinging Nettle ◽

Toxin Genes ◽

Nettle Caterpillar

Many animal species can produce venom for defense, predation, and competition. The venom usually contains diverse peptide and protein toxins, including neurotoxins, proteolytic enzymes, protease inhibitors, and allergens. Some drugs for cancer, neurological disorders, and analgesics were developed based on animal toxin structures and functions. Several caterpillar species possess venoms that cause varying effects on humans both locally and systemically. However, toxins from only a few species have been investigated, limiting the full understanding of the Lepidoptera toxin diversity and evolution. We used the RNA-seq technique to identify toxin genes from the stinging nettle caterpillar, Parasa lepida (Cramer, 1799). We constructed a transcriptome from caterpillar urticating hairs and reported 34,968 unique transcripts. Using our toxin gene annotation pipeline, we identified 168 candidate toxin genes, including protease inhibitors, proteolytic enzymes, and allergens. The 21 P. lepida novel Knottin-like peptides, which do not show sequence similarity to any known peptide, have predicted 3D structures similar to tarantula, scorpion, and cone snail neurotoxins. We highlighted the importance of convergent evolution in the Lepidoptera toxin evolution and the possible mechanisms. This study opens a new path to understanding the hidden diversity of Lepidoptera toxins, which could be a fruitful source for developing new drugs.

Download Full-text