scholarly journals ASPic-GeneID: A Lightweight Pipeline for Gene Prediction and Alternative Isoforms Detection

2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Tyler Alioto ◽  
Ernesto Picardi ◽  
Roderic Guigó ◽  
Graziano Pesole

New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurateab initiogene prediction methods. However, it is apparent that fullyab initiomethods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entireC. elegansgenome and the 44 ENCODE human pilot regions.

2019 ◽  
Vol 20 (S15) ◽  
Author(s):  
Prapaporn Techa-Angkoon ◽  
Kevin L. Childs ◽  
Yanni Sun

Abstract Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.


2020 ◽  
Author(s):  
Subodh K Srivastava ◽  
Kurt Zeller ◽  
James H Sobieraj ◽  
Mark K Nakhla

Whole Genome Sequence (WGS) based identifications are being increasingly used by regulatory and public health agencies to facilitate the detection, investigation, and control of pathogens and pests. Fusarium oxysporum f. sp. vasinfectum (FOV) is a significant vascular wilt pathogen of cultivated cotton, and consists of several pathogenic races that are not each other’s closest phylogenetic relatives. We have developed WGS assemblies for isolates of race 1 (FOV1), race 4 (FOV4), race 5 (FOV5), and race 8 (FOV8) using a combination of Nanopore (MinION) and Illumina sequencing technology (Mi-Seq). This resulted in assembled contigs with more than 100X coverage for each of the FOV races and estimated genome sizes of FOV1 52 Mb, FOV4 68 Mb, FOV5 68 Mb and FOV8 55 Mb. The AUGUSTUS gene prediction program predicted 16,263 genes in FOV1, 20,259 genes in FOV4, 20,375 genes in FOV5 and 16,615 genes in FOV8. We were able to identify 525 genes unique to FOV1, 570 unique to FOV4, 1242 unique to FOV5 and 383 unique to FOV8. We expect that these findings will help in comparative genomics, and in the identification of unique genes as candidate targets for diagnostic marker and methods development to permit rapid differentiation of FOV subgroups.


Blood ◽  
2002 ◽  
Vol 99 (12) ◽  
pp. 4638-4641 ◽  
Author(s):  
Jacqueline Boultwood ◽  
Carrie Fidler ◽  
Amanda J. Strickson ◽  
Fiona Watkins ◽  
Susana Gama ◽  
...  

The 5q− syndrome is the most distinct of the myelodysplastic syndromes, and the molecular basis for this disorder remains unknown. We describe the narrowing of the common deleted region (CDR) of the 5q− syndrome to the approximately 1.5-megabases interval at 5q32 flanked by D5S413 and theGLRA1 gene. The Ensembl gene prediction program has been used for the complete genomic annotation of the CDR. The CDR is gene rich and contains 24 known genes and 16 novel (predicted) genes. Of 40 genes in the CDR, 33 are expressed in CD34+ cells and, therefore, represent candidate genes since they are expressed within the hematopoietic stem/progenitor cell compartment. A number of the genes assigned to the CDR represent good candidates for the 5q− syndrome, including MEGF1, G3BP, and several of the novel gene predictions. These data now afford a comprehensive mutational/expression analysis of all candidate genes assigned to the CDR.


2020 ◽  
Vol 8 (1) ◽  
pp. 102 ◽  
Author(s):  
Tangcheng Li ◽  
Liying Yu ◽  
Bo Song ◽  
Yue Song ◽  
Ling Li ◽  
...  

Cataloging an accurate functional gene set for the Symbiodiniaceae species is crucial for addressing biological questions of dinoflagellate symbiosis with corals and other invertebrates. To improve the gene models of Fugacium kawagutii, we conducted high-throughput chromosome conformation capture (Hi-C) for the genome and Illumina combined with PacBio sequencing for the transcriptome to achieve a new genome assembly and gene prediction. A 0.937-Gbp assembly of F. kawagutii were obtained, with a N50 > 13 Mbp and the longest scaffold of 121 Mbp capped with telomere motif at both ends. Gene annotation produced 45,192 protein-coding genes, among which, 11,984 are new compared to previous versions of the genome. The newly identified genes are mainly enriched in 38 KEGG pathways including N-Glycan biosynthesis, mRNA surveillance pathway, cell cycle, autophagy, mitophagy, and fatty acid synthesis, which are important for symbiosis, nutrition, and reproduction. The newly identified genes also included those encoding O-methyltransferase (O-MT), 3-dehydroquinate synthase, homologous-pairing protein 2-like (HOP2) and meiosis protein 2 (MEI2), which function in mycosporine-like amino acids (MAAs) biosynthesis and sexual reproduction, respectively. The improved version of the gene set (Fugka_Geneset _V3) raised transcriptomic read mapping rate from 33% to 54% and BUSCO match from 29% to 55%. Further differential gene expression analysis yielded a set of stably expressed genes under variable trace metal conditions, of which 115 with annotated functions have recently been found to be stably expressed under three other conditions, thus further developing the “core gene set” of F. kawagutii. This improved genome will prove useful for future Symbiodiniaceae transcriptomic, gene structure, and gene expression studies, and the refined “core gene set” will be a valuable resource from which to develop reference genes for gene expression studies.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Nicolas Scalzitti ◽  
Anne Jeannin-Girardon ◽  
Pierre Collet ◽  
Olivier Poch ◽  
Julie D. Thompson

2005 ◽  
Vol 57 (3) ◽  
pp. 445-460 ◽  
Author(s):  
Hong Yao ◽  
Ling Guo ◽  
Yan Fu ◽  
Lisa A. Borsuk ◽  
Tsui-Jung Wen ◽  
...  
Keyword(s):  

2017 ◽  
Author(s):  
Megan J. Bowman ◽  
Jane A. Pulman ◽  
Tiffany L. Liu ◽  
Kevin L. Childs

AbstractAccurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. We find that gene prediction programs trained on genes with random GC content do not completely predict all grass genes with extreme GC content. We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa.


Sign in / Sign up

Export Citation Format

Share Document