BITACORA: A comprehensive tool for the identification and annotation of gene families in genome assemblies

AbstractGene annotation is a critical bottleneck in genomic research, especially for the comprehensive study of very large gene families in the genomes of non-model organisms. Despite the recent progress in automatic methods, the tools developed for this task often produce inaccurate annotations, such as fused, chimeric, partial or even completely absent gene models for many family copies, which require considerable extra efforts to be amended. Here we present BITACORA, a bioinformatics solution that integrates sequence similarity search tools and Perl scripts to facilitate both the curation of these inaccurate annotations and the identification of previously undetected gene family copies directly from DNA sequences. We tested the performance of the BITACORA pipeline in annotating the members of two chemosensory gene families of different sizes in seven available chelicerate genome drafts. Despite the relatively high fragmentation of some of these drafts, BITACORA was able to improve the annotation of many members of these families and detected thousands of new chemoreceptors encoded in genome sequences. The program generates an output file in the general feature format (GFF) files, with both curated and novel gene models, and a FASTA file with the predicted proteins. These outputs can be easily integrated in genomic annotation editors, greatly facilitating subsequent manual annotation and downstream evolutionary analyses.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.1101/307777 ◽

2018 ◽

Author(s):

Hamid Bagher ◽

Usha Muppiral ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Gene Annotation ◽

Large Data ◽

Biological Data ◽

Genomic Research ◽

Data Repository ◽

Small Data ◽

Data Repositories ◽

Shared Data ◽

Genome Assemblies

AbstractBackgroundCreating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.ResultsHere, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species.ConclusionsIn order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

Download Full-text

Transcriptome Analysis in Domesticated Species: Challenges and Strategies

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29334 ◽

2015 ◽

Vol 9S4 ◽

pp. BBI.S29334 ◽

Cited By ~ 4

Author(s):

Jessica P. Hekman ◽

Jennifer L Johnson ◽

Anna V. Kukekova

Keyword(s):

Complex Traits ◽

Gene Networks ◽

Association Studies ◽

Cultural Value ◽

Genomic Research ◽

Model Organisms ◽

Genome Wide Association Studies ◽

Rna Seq ◽

Genome Wide ◽

Genome Assemblies

Domesticated species occupy a special place in the human world due to their economic and cultural value. In the era of genomic research, domesticated species provide unique advantages for investigation of diseases and complex phenotypes. RNA sequencing, or RNA-seq, has recently emerged as a new approach for studying transcriptional activity of the whole genome, changing the focus from individual genes to gene networks. RNA-seq analysis in domesticated species may complement genome-wide association studies of complex traits with economic importance or direct relevance to biomedical research. However, RNA-seq studies are more challenging in domesticated species than in model organisms. These challenges are at least in part associated with the lack of quality genome assemblies for some domesticated species and the absence of genome assemblies for others. In this review, we discuss strategies for analyzing RNA-seq data, focusing particularly on questions and examples relevant to domesticated species.

Download Full-text

Conversion between 100-million-year-old duplicated genes contributes to rice subspecies divergence

10.1101/2020.12.22.424042 ◽

2020 ◽

Author(s):

Chendan Wei ◽

Zhenyi Wang ◽

Jianyu Wang ◽

Jia Teng ◽

Shaoqi Shen ◽

...

Keyword(s):

Gene Conversion ◽

Sequence Similarity ◽

Phylogenetic Analyses ◽

Chromosome Rearrangement ◽

Gene Families ◽

Single Pair ◽

Duplicated Genes ◽

Cultivated Rice ◽

Large Gene ◽

Conversion Rates

AbstractExtensive sequence similarity between duplicated gene pairs produced by paleo-polyploidization may result from illegitimate recombination between homologous chromosomes. The genomes of Asian cultivated rice Xian/indica (XI) and Geng/japonica (GJ) have recently been updated, providing new opportunities for investigating on-going gene conversion events. Using comparative genomics and phylogenetic analyses, we evaluated gene conversion rates between duplicated genes produced by polyploidization 100 million years ago (mya) in GJ and XI. At least 5.19%–5.77% of genes duplicated across three genomes were affected by whole-gene conversion after the divergence of GJ and XI at ~0.4 mya, with more (7.77%–9.53%) showing conversion of only gene portions. Independently converted duplicates surviving in genomes of different subspecies often used the same donor genes. On-going gene conversion frequency was higher near chromosome termini, with a single pair of homoeologous chromosomes 11 and 12 in each genome most affected. Notably, on-going gene conversion has maintained similarity between very ancient duplicates, provided opportunities for further gene conversion, and accelerated rice divergence. Chromosome rearrangement after polyploidization may result in gene loss, providing a basis for on-going gene conversion, and may have contributed directly to restricted recombination/conversion between homoeologous regions. Gene conversion affected biological functions associated with multiple genes, such as catalytic activity, implying opportunities for interaction among members of large gene families, such as NBS-LRR disease-resistance genes, resulting in gene conversion. Duplicated genes in rice subspecies generated by grass polyploidization ~100 mya remain affected by gene conversion at high frequency, with important implications for the divergence of rice subspecies.One-sentence summaryOn-going gene conversion between duplicated genes produced by 100 mya polyploidization contributes to rice subspecies divergence, often involving the same donor genes at chromosome termini.

Download Full-text

From single nuclei to whole genome assemblies

10.1101/625814 ◽

2019 ◽

Cited By ~ 3

Author(s):

Merce Montoliu-Nerin ◽

Marisol Sánchez-García ◽

Claudia Bergin ◽

Manfred Grabherr ◽

Barbara Ellis ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Genomic Data ◽

Life Cycles ◽

Genomic Research ◽

Metagenomic Data ◽

Model Organisms ◽

Genomic Study ◽

And Function ◽

Genome Assemblies

SummaryA large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates1–4. Genomic data are key to unravel both their identity and function5. The development of metagenomic methods6,7 and the advent of single cell sequencing8–10 have revolutionized the study of life and function of cryptic organisms by upending the need for large and pure biological material, and allowing generation of genomic data from complex or limited environmental samples. Genome assemblies from metagenomic data have so far been restricted to organisms with small genomes, such as bacteria11, archaea12 and certain eukaryotes13. On the other hand, single cell technologies have allowed the targeting of unicellular organisms, attaining a better resolution than metagenomics8,9,14–16, moreover, it has allowed the genomic study of cells from complex organisms one cell at a time17,18. However, single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life, including sponges19, lichens3,20, intracellular parasites21,22, and plant endophytes23,24. Among the most important plant endophytes are the obligate mutualistic symbionts, arbuscular mycorrhizal (AM) fungi, that pose an additional challenge with their multinucleate coenocytic mycelia25. Here, the development of a novel single nuclei sequencing and assembly workflow is reported. This workflow allows, for the first time, the generation of reference genome assemblies from large scale, unbiased sorted, and sequenced AM fungal nuclei circumventing tedious, and often impossible, culturing efforts. This method opens infinite possibilities for studies of evolution and adaptation in these important plant symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.

Download Full-text

Reference Genome for the Highly Transformable Setaria viridis ME034V

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401345 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3467-3478 ◽

Cited By ~ 2

Author(s):

Peter M. Thielen ◽

Amanda L. Pendleton ◽

Robert A. Player ◽

Kenneth V. Bowden ◽

Thomas J. Lawton ◽

...

Keyword(s):

De Novo ◽

Gene Families ◽

Model Organisms ◽

Phylogenomic Analysis ◽

Setaria Viridis ◽

Sequencing Technology ◽

Protein Coding ◽

Genotype Frequencies ◽

Green Foxtail ◽

Genome Assemblies

Setaria viridis (green foxtail) is an important model system for improving cereal crops due to its diploid genome, ease of cultivation, and use of C4 photosynthesis. The S. viridis accession ME034V is exceptionally transformable, but the lack of a sequenced genome for this accession has limited its utility. We present a 397 Mb highly contiguous de novo assembly of ME034V using ultra-long nanopore sequencing technology (read N50 = 41kb). We estimate that this genome is largely complete based on our updated k-mer based genome size estimate of 401 Mb for S. viridis. Genome annotation identified 37,908 protein-coding genes and >300k repetitive elements comprising 46% of the genome. We compared the ME034V assembly with two other previously sequenced Setaria genomes as well as to a diversity panel of 235 S. viridis accessions. We found the genome assemblies to be largely syntenic, but numerous unique polymorphic structural variants were discovered. Several ME034V deletions may be associated with recent retrotransposition of copia and gypsy LTR repeat families, as evidenced by their low genotype frequencies in the sampled population. Lastly, we performed a phylogenomic analysis to identify gene families that have expanded in Setaria, including those involved in specialized metabolism and plant defense response. The high continuity of the ME034V genome assembly validates the utility of ultra-long DNA sequencing to improve genetic resources for emerging model organisms. Structural variation present in Setaria illustrates the importance of obtaining the proper genome reference for genetic experiments. Thus, we anticipate that the ME034V genome will be of significant utility for the Setaria research community.

Download Full-text

Construction of coffee transcriptome networks based on gene annotation semantics

Journal of Integrative Bioinformatics ◽

10.1515/jib-2012-205 ◽

2012 ◽

Vol 9 (3) ◽

pp. 80-92 ◽

Cited By ~ 1

Author(s):

Luis F. Castillo ◽

Narmer Galeano ◽

Gustavo A. Isaza ◽

Alvaro Gaitan

Keyword(s):

Gene Networks ◽

Gene Annotation ◽

Expression Patterns ◽

Biological Significance ◽

Local Alignment ◽

Large Gene ◽

Semantic Concepts ◽

Pathways Analysis ◽

Gene Models ◽

Description Framework

Summary Gene annotation is a process that encompasses multiple approaches on the analysis of nucleic acids or protein sequences in order to assign structural and functional characteristics to gene models. When thousands of gene models are being described in an organism genome, construction and visualization of gene networks impose novel challenges in the understanding of complex expression patterns and the generation of new knowledge in genomics research. In order to take advantage of accumulated text data after conventional gene sequence analysis, this work applied semantics in combination with visualization tools to build transcriptome networks from a set of coffee gene annotations. A set of selected coffee transcriptome sequences, chosen by the quality of the sequence comparison reported by Basic Local Alignment Search Tool (BLAST) and Interproscan, were filtered out by coverage, identity, length of the query, and e-values. Meanwhile, term descriptors for molecular biology and biochemistry were obtained along the Wordnet dictionary in order to construct a Resource Description Framework (RDF) using Ruby scripts and Methontology to find associations between concepts. Relationships between sequence annotations and semantic concepts were graphically represented through a total of 6845 oriented vectors, which were reduced to 745 non-redundant associations. A large gene network connecting transcripts by way of relational concepts was created where detailed connections remain to be validated for biological significance based on current biochemical and genetics frameworks. Besides reusing text information in the generation of gene connections and for data mining purposes, this tool development opens the possibility to visualize complex and abundant transcriptome data, and triggers the formulation of new hypotheses in metabolic pathways analysis.

Download Full-text

PIC-Me: paralogs and isoforms classifier based on machine-learning approaches

BMC Bioinformatics ◽

10.1186/s12859-021-04229-x ◽

2021 ◽

Vol 22 (S11) ◽

Author(s):

Jooseong Oh ◽

Sung-Gwon Lee ◽

Chungoo Park

Keyword(s):

Machine Learning ◽

Large Scale ◽

Gene Annotation ◽

Sequence Similarity ◽

Global Analysis ◽

Model Organism ◽

Model Organisms ◽

Support Vector ◽

Learning Approaches ◽

Rna Seq

Abstract Background Paralogs formed through gene duplication and isoforms formed through alternative splicing have been important processes for increasing protein diversity and maintaining cellular homeostasis. Despite their recognized importance and the advent of large-scale genomic and transcriptomic analyses, paradoxically, accurate annotations of all gene loci to allow the identification of paralogs and isoforms remain surprisingly incomplete. In particular, the global analysis of the transcriptome of a non-model organism for which there is no reference genome is especially challenging. Results To reliably discriminate between the paralogs and isoforms in RNA-seq data, we redefined the pre-existing sequence features (sequence similarity, inverse count of consecutive identical or non-identical blocks, and match-mismatch fraction) previously derived from full-length cDNAs and EST sequences and described newly discovered genomic and transcriptomic features (twilight zone of protein sequence alignment and expression level difference). In addition, the effectiveness and relevance of the proposed features were verified with two widely used support vector machine (SVM) and random forest (RF) models. From nine RNA-seq datasets, all AUC (area under the curve) scores of ROC (receiver operating characteristic) curves were over 0.9 in the RF model and significantly higher than those in the SVM model. Conclusions In this study, using an RF model with five proposed RNA-seq features, we implemented our method called Paralogs and Isoforms Classifier based on Machine-learning approaches (PIC-Me) and showed that it outperformed an existing method. Finally, we envision that our tool will be a valuable computational resource for the genomics community to help with gene annotation and will aid in comparative transcriptomics and evolutionary genomics studies, especially those on non-model organisms.

Download Full-text

An integrated multi-level comparison highlights common aspects and specific features between distantly-related species: Tomato and Grapevine

10.7287/peerj.preprints.2208v1 ◽

2016 ◽

Author(s):

Luca Ambrosino ◽

Hamed Bostan ◽

Valentino Ruggieri ◽

Maria Luisa Chiusano

Keyword(s):

Comparative Genomics ◽

Developmental Stages ◽

Gene Annotation ◽

Sequence Similarity ◽

Gene Families ◽

Plant Evolution ◽

Loss Of Function ◽

Evolutionary Mechanisms ◽

Important Species ◽

Similarity Searches

Motivation. Even after years from the first completion of genomes by sequencing, comparative genomics still remains a challenge, also enhanced by the availability of numerous draft genomes with still poor annotation quality. The detection of ortholog genes between different species is a key approach for comparative genomics. For example, ortholog gene detection may support investigations on mechanisms that shaped the organization of the genomes, highlighting on gain or loss of function and on gene annotation. On the other hand, the detection of paralog genes is fundamental for understanding the evolutionary mechanisms that drove gene function innovation and support gene families analyses. Here we report on the gene comparison between two distantly related plants, Solanum lycopersicum (Tomato) (The Tomato Genome Consortium 2012) and Vitis vinifera (Grapevine) (Jaillon et al. 2007), considered as economically important species from asterids and rosids clades, respectively. The strategy was accompanied by integration of multilevel analyses, from domain investigations to expression profiling, to get to the most reliable results and to offer powerful resources, in order to understand different useful aspects of plant evolution and physiology and to dissect traits and molecular aspects that could provide novel tools for agriculture applications and biotechnologies. Methods. In order to predict best putative orthologs and paralogs between Tomato and Grapevine, and to overcome possible annotation issues, all-against-all sequence similarity searches between genes, mRNAs and proteins collections of both species were performed. A Bidirectional Best Hit approach was implemented to detect the best orthologs between the two species. Moreover we developed a dedicated algorithm in Python programming language able to define more extended alignments between mRNA sequences. NetworkX package (Hagberg et al. 2008) was used to define networks of paralogs and orthologs. Proteins domain prediction was carried out on the entire Tomato and Grapevine protein collection by using InterProScan program (Jones et al. 2014). The enzyme classification was obtained by sequence similarity searches between Tomato and Grapevine mRNA collections and the entire UniProt reviewed protein collection (UniProt consortium 2015). The metabolic pathways associated to the detected enzymes were identified exploiting the KEGG Database (Kanehisa and Goto 2000). Expression level of three developmental stages of Tomato (2 cm fruit, breaker and mature red) and the corresponding stages of Grapevine (post-setting, veraison, mature berry) was defined on the basis of the iTAG loci (Shearer et al. 2014) and v1 vitis loci, respectively. The expression was normalized by Reads Per Kilobases per Million (RPKM) for each tissue/stage. Abstract truncated at 3,000 characters - the full version is available in the pdf file

Download Full-text

CAARS: comparative assembly and annotation of RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/bty903 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2199-2207 ◽

Cited By ~ 1

Author(s):

Carine Rey ◽

Philippe Veber ◽

Bastien Boussau ◽

Marie Sémon

Keyword(s):

Gene Family ◽

De Novo ◽

Sequence Similarity ◽

Gene Families ◽

Supplementary Information ◽

Model Organisms ◽

Difficult Case ◽

Rna Seq ◽

Comparative Analyses ◽

Family Reconstruction

Abstract Motivation RNA sequencing (RNA-Seq) is a widely used approach to obtain transcript sequences in non-model organisms, notably for performing comparative analyses. However, current bioinformatic pipelines do not take full advantage of pre-existing reference data in related species for improving RNA-Seq assembly, annotation and gene family reconstruction. Results We built an automated pipeline named CAARS to combine novel data from RNA-Seq experiments with existing multi-species gene family alignments. RNA-Seq reads are assembled into transcripts by both de novo and assisted assemblies. Then, CAARS incorporates transcripts into gene families, builds gene alignments and trees and uses phylogenetic information to classify the genes as orthologs and paralogs of existing genes. We used CAARS to assemble and annotate RNA-Seq data in rodents and fishes using distantly related genomes as reference, a difficult case for this kind of analysis. We showed CAARS assemblies are more complete and accurate than those assembled by a standard pipeline consisting of de novo assembly coupled with annotation by sequence similarity on a guide species. In addition to annotated transcripts, CAARS provides gene family alignments and trees, annotated with orthology relationships, directly usable for downstream comparative analyses. Availability and implementation CAARS is implemented in Python and Ocaml and is freely available at https://github.com/carinerey/caars. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text