Crowdsourcing biocuration: the Community Assessment of Community Annotation with Ontologies (CACAO)

Experimental data about known gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a ten-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills. Significance Statement: The primary scientific literature catalogs the results from publicly funded scientific research about gene function in human-readable format. Information captured from those studies in a widely adopted, machine-readable standard format comes in the form of Gene Ontology annotations about gene functions from all domains of life. Manual annotations based on inferences directly from the scientific literature, including the evidence used to make such inferences, represents the best return on investment by improving data accessibility across the biological sciences. To supplement professional curation, our CACAO project enabled annotation of the scientific literature by community annotators, in this case undergraduates, which resulted in contribution of thousands of validated entries to public resources. These annotations are now being used by scientists worldwide.

Download Full-text

Alliance of Genome Resources Portal: unified model organism research platform

Nucleic Acids Research ◽

10.1093/nar/gkz813 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D650-D658 ◽

Cited By ~ 36

Author(s):

◽

Julie Agapite ◽

Laurent-Philippe Albou ◽

Suzi Aleksander ◽

Joanna Argasinska ◽

...

Keyword(s):

Gene Ontology ◽

Model Organism ◽

Model Organisms ◽

Data Types ◽

Primary Model ◽

Genomic Studies ◽

Health And Disease ◽

Extensive Body ◽

Access To Data ◽

Model Organism Databases

Abstract The Alliance of Genome Resources (Alliance) is a consortium of the major model organism databases and the Gene Ontology that is guided by the vision of facilitating exploration of related genes in human and well-studied model organisms by providing a highly integrated and comprehensive platform that enables researchers to leverage the extensive body of genetic and genomic studies in these organisms. Initiated in 2016, the Alliance is building a central portal (www.alliancegenome.org) for access to data for the primary model organisms along with gene ontology data and human data. All data types represented in the Alliance portal (e.g. genomic data and phenotype descriptions) have common data models and workflows for curation. All data are open and freely available via a variety of mechanisms. Long-term plans for the Alliance project include a focus on coverage of additional model organisms including those without dedicated curation communities, and the inclusion of new data types with a particular focus on providing data and tools for the non-model-organism researcher that support enhanced discovery about human health and disease. Here we review current progress and present immediate plans for this new bioinformatics resource.

Download Full-text

Hidden in plain sight: What remains to be discovered in the eukaryotic proteome?

10.1101/469569 ◽

2018 ◽

Author(s):

Valerie Wood ◽

Antonia Lock ◽

Midori A. Harris ◽

Kim Rutherford ◽

Jürg Bähler ◽

...

Keyword(s):

Gene Ontology ◽

Fission Yeast ◽

Genome Sequencing ◽

Large Scale ◽

Biological Process ◽

Blind Spot ◽

Model Organisms ◽

Biological Processes ◽

Health And Disease

AbstractThe first decade of genome sequencing stimulated an explosion in the characterization of unknown proteins. More recently, the pace of functional discovery has slowed, leaving around 20% of the proteins even in well-studied model organisms without informative descriptions of their biological roles. Remarkably, many uncharacterized proteins are conserved from yeasts to human, suggesting that they contribute to fundamental biological processes. To fully understand biological systems in health and disease, we need to account for every part of the system. Unstudied proteins thus represent a collective blind spot that limits the progress of both basic and applied biosciences.We use a simple yet powerful metric based on Gene Ontology (GO) biological process terms to define characterized and uncharacterized proteins for human, budding yeast, and fission yeast. We then identify a set of conserved but unstudied proteins in S. pombe, and classify them based on a combination of orthogonal attributes determined by large-scale experimental and comparative methods. Finally, we explore possible reasons why these proteins remain neglected, and propose courses of action to raise their profile and thereby reap the benefits of completing the catalog of proteins’ biological roles.

Download Full-text

Gene Function Prediction and Functional Network: The Role of Gene Ontology

Intelligent Systems Reference Library - Data Mining: Foundations and Intelligent Paradigms ◽

10.1007/978-3-642-23151-3_7 ◽

2012 ◽

pp. 123-162 ◽

Cited By ~ 1

Author(s):

Erliang Zeng ◽

Chris Ding ◽

Kalai Mathee ◽

Lisa Schneper ◽

Giri Narasimhan

Keyword(s):

Gene Ontology ◽

Gene Function ◽

Function Prediction ◽

Functional Network ◽

Gene Function Prediction

Download Full-text

A Universal, Genomewide GuideFinder for CRISPR/Cas9 Targeting in Microbial Genomes

mSphere ◽

10.1128/msphere.00086-20 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Michelle Spoto ◽

Changhui Guan ◽

Elizabeth Fleming ◽

Julia Oh

Keyword(s):

Gene Function ◽

Large Scale ◽

Essential Gene ◽

Bacterial Species ◽

Bacterial Genome ◽

Model Organisms ◽

Design Parameters ◽

Bacterial Genomes ◽

Wide Range ◽

User Friendly

ABSTRACT The CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression and activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, few programs are specifically tailored to identify guides in draft bacterial genomes genomewide. Furthermore, few programs offer open-source code with flexible design parameters for bacterial targeting. To address these limitations, we created GuideFinder, a customizable, user-friendly program that can design guides for any annotated bacterial genome. GuideFinder designs guides from NGG protospacer-adjacent motif (PAM) sites for any number of genes by the use of an annotated genome and FASTA file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, one of several features unique to GuideFinder. GuideFinder can also identify paired guides for targeting multiplicity, whose validity we tested experimentally. GuideFinder has been tested on a variety of diverse bacterial genomes, finding guides for 95% of genes on average. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species. Through the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies of a variety of bacterial species. IMPORTANCE With the explosion in our understanding of human and environmental microbial diversity, corresponding efforts to understand gene function in these organisms are strongly needed. CRISPR/Cas9 technology has revolutionized interrogation of gene function in a wide variety of model organisms. Efficient CRISPR guide design is required for systematic gene targeting. However, existing tools are not adapted for the broad needs of microbial targeting, which include extraordinary species and subspecies genetic diversity, the overwhelming majority of which is characterized by draft genomes. In addition, flexibility in guide design parameters is important to consider the wide range of factors that can affect guide efficacy, many of which can be species and strain specific. We designed GuideFinder, a customizable, user-friendly program that addresses the limitations of existing software and that can design guides for any annotated bacterial genome with numerous features that facilitate guide design in a wide variety of microorganisms.

Download Full-text

Expert–Novice Comparison Reveals Pedagogical Implications for Students’ Analysis of Primary Literature

CBE—Life Sciences Education ◽

10.1187/cbe.18-05-0077 ◽

2019 ◽

Vol 18 (4) ◽

pp. ar56 ◽

Cited By ~ 1

Author(s):

April A. Nelms ◽

Miriam Segura-Totten

Keyword(s):

Cognitive Load ◽

Scientific Literacy ◽

Cognitive Load Theory ◽

Scientific Literature ◽

Data Interpretation ◽

Scientific Article ◽

Load Theory ◽

Research Article ◽

Primary Literature ◽

Expert Novice

Student engagement in the analysis of primary scientific literature increases critical thinking, scientific literacy, data evaluation, and science process skills. However, little is known about the process by which expertise in reading scientific articles develops. For this reason, we decided to compare how faculty experts and student novices engage with a research article. We performed think-aloud interviews of biology faculty and undergraduates as they read through a scientific article. We analyzed these interviews using qualitative methods. We grounded data interpretation in cognitive load theory and the ICAP (interactive, constructive, active, and passive) framework. Our results revealed that faculty have more complex schemas than students and that they reduce cognitive load through two main mechanisms: summarizing and note-taking. Faculty also engage with articles at a higher cognitive level, described as constructive by the ICAP framework, when compared with students. More complex schemas, effectively lowering cognitive load, and deeper engagement with the text may help explain why faculty encounter fewer comprehension difficulties than students in our study. Finally, faculty analyze and evaluate data more often than students when reading the text. Findings include a discussion of successful pedagogical approaches for instructors wishing to enhance undergraduates’ comprehension and analysis of research articles.

Download Full-text

Considering RNAi experimental design in parasitic helminths

Parasitology ◽

10.1017/s0031182011001946 ◽

2012 ◽

Vol 139 (5) ◽

pp. 589-604 ◽

Cited By ~ 25

Author(s):

JOHNATHAN J. DALZELL ◽

NEIL D. WARNOCK ◽

PAUL MCVEIGH ◽

NIKKI J. MARKS ◽

ANGELA MOUSLEY ◽

...

Keyword(s):

Rna Interference ◽

Experimental Design ◽

Gene Function ◽

Experimental Validation ◽

Parasite Species ◽

Model Organisms ◽

Helminth Parasites ◽

Parasitic Helminth ◽

Parasitic Helminths ◽

Parasite Biology

SUMMARYAlmost a decade has passed since the first report of RNA interference (RNAi) in a parasitic helminth. Whilst much progress has been made with RNAi informing gene function studies in disparate nematode and flatworm parasites, substantial and seemingly prohibitive difficulties have been encountered in some species, hindering progress. An appraisal of current practices, trends and ideals of RNAi experimental design in parasitic helminths is both timely and necessary for a number of reasons: firstly, the increasing availability of parasitic helminth genome/transcriptome resources means there is a growing need for gene function tools such as RNAi; secondly, fundamental differences and unique challenges exist for parasite species which do not apply to model organisms; thirdly, the inherent variation in experimental design, and reported difficulties with reproducibility undermine confidence. Ideally, RNAi studies of gene function should adopt standardised experimental design to aid reproducibility, interpretation and comparative analyses. Although the huge variations in parasite biology and experimental endpoints make RNAi experimental design standardization difficult or impractical, we must strive to validate RNAi experimentation in helminth parasites. To aid this process we identify multiple approaches to RNAi experimental validation and highlight those which we deem to be critical for gene function studies in helminth parasites.

Download Full-text

Gene function prediction with knowledge from gene ontology

International Journal of Data Mining and Bioinformatics ◽

10.1504/ijdmb.2015.070840 ◽

2015 ◽

Vol 13 (1) ◽

pp. 50

Author(s):

Ying Shen ◽

Lin Zhang

Keyword(s):

Gene Ontology ◽

Gene Function ◽

Function Prediction ◽

Gene Function Prediction

Download Full-text

Patterns of diverse gene functions in genomic neighborhoods predict gene function and phenotype

10.1101/582577 ◽

2019 ◽

Author(s):

Matej Mihelčić ◽

Tomislav Šmuc ◽

Fran Supek

Keyword(s):

Gene Function ◽

Structural Variation ◽

Predictive Power ◽

State Of The Art ◽

Gene Clustering ◽

High Confidence ◽

Gene Functions ◽

Predict Gene Function ◽

Guilt By Association ◽

Clustering Patterns

AbstractGenes with similar roles in the cell are known to cluster on chromosomes, thus benefiting from coordinated regulation. This allows gene function to be inferred by transferring annotations from genomic neighbors, following the guilt-by-association principle. We performed a systematic search for co-occurrence of >1000 gene functions in genomic neighborhoods across 1669 prokaryotic, 49 fungal and 80 metazoan genomes, revealing prevalent patterns that cannot be explained by clustering of functionally similar genes. It is a very common occurrence that pairs of dissimilar gene functions – corresponding to semantically distant Gene Ontology terms – are significantly co-located on chromosomes. These neighborhood associations are often as conserved across genomes as the known associations between similar functions, suggesting selective benefits from clustering of certain diverse functions, which may conceivably play complementary roles in the cell. We propose a simple encoding of chromosomal gene order, the neighborhood function profiles (NFP), which draws on diverse gene clustering patterns to predict gene function and phenotype. NFPs yield a 26-46% increase in predictive power over state-of-the-art approaches that propagate function across neighborhoods, thus providing hundreds of novel, high-confidence gene function inferences per genome. Furthermore, we demonstrate that the effect of structural variation on gene function distribution across chromosomes may be used to predict phenotype of individuals from their genome sequence.

Download Full-text

Ten simple rules to facilitate evidence implementation in the environmental sciences

FACETS ◽

10.1139/facets-2020-0021 ◽

2020 ◽

Vol 5 (1) ◽

pp. 642-650

Author(s):

Christopher J. Lortie ◽

Malory Owen

Keyword(s):

Environmental Management ◽

Decision Making ◽

Scientific Literature ◽

Scientific Writing ◽

Environmental Sciences ◽

Relevant Evidence ◽

Primary Literature ◽

Simple Rules ◽

Sustainable Societies ◽

Structure Knowledge

There is a gap between fundamental science and managers. There are many general solutions including the need to better leverage the primary scientific literature for decision-making. Herein, we provide a list of 10 simple rules to support environmental management through better scientific writing and suggest practices for more transparent publications. These rules can also be used as a checklist for reusing the primary literature when searching for relevant evidence in the environmental sciences. We need to better structure knowledge in papers for connections within sustainable societies.

Download Full-text

Recent advances in gene function prediction using context-specific coexpression networks in plants

F1000Research ◽

10.12688/f1000research.17207.1 ◽

2019 ◽

Vol 8 ◽

pp. 153 ◽

Cited By ~ 2

Author(s):

Chirag Gupta ◽

Andy Pereira

Keyword(s):

Gene Function ◽

Complex Traits ◽

Large Fraction ◽

Genetic Correlations ◽

Short Review ◽

Cellular Functions ◽

Gene Functions ◽

Plant Genes ◽

Context Specific ◽

Coexpression Networks

Predicting gene functions from genome sequence alone has been difficult, and the functions of a large fraction of plant genes remain unknown. However, leveraging the vast amount of currently available gene expression data has the potential to facilitate our understanding of plant gene functions, especially in determining complex traits. Gene coexpression networks—created by integrating multiple expression datasets—connect genes with similar patterns of expression across multiple conditions. Dense gene communities in such networks, commonly referred to as modules, often indicate that the member genes are functionally related. As such, these modules serve as tools for generating new testable hypotheses, including the prediction of gene function and importance. Recently, we have seen a paradigm shift from the traditional “global” to more defined, context-specific coexpression networks. Such coexpression networks imply genetic correlations in specific biological contexts such as during development or in response to a stress. In this short review, we highlight a few recent studies that attempt to fill the large gaps in our knowledge about cellular functions of plant genes using context-specific coexpression networks.

Download Full-text