FILER: a framework for harmonizing and querying large-scale functional genomics knowledge

ABSTRACT Querying massive functional genomic and annotation data collections, linking and summarizing the query results across data sources/data types are important steps in high-throughput genomic and genetic analytical workflows. However, these steps are made difficult by the heterogeneity and breadth of data sources, experimental assays, biological conditions/tissues/cell types and file formats. FILER (FunctIonaL gEnomics Repository) is a framework for querying large-scale genomics knowledge with a large, curated integrated catalog of harmonized functional genomic and annotation data coupled with a scalable genomic search and querying interface. FILER uniquely provides: (i) streamlined access to >50 000 harmonized, annotated genomic datasets across >20 integrated data sources, >1100 tissues/cell types and >20 experimental assays; (ii) a scalable genomic querying interface; and (iii) ability to analyze and annotate user’s experimental data. This rich resource spans >17 billion GRCh37/hg19 and GRCh38/hg38 genomic records. Our benchmark querying 7 × 109 hg19 FILER records shows FILER is highly scalable, with a sub-linear 32-fold increase in querying time when increasing the number of queries 1000-fold from 1000 to 1 000 000 intervals. Together, these features facilitate reproducible research and streamline integrating/querying large-scale genomic data within analyses/workflows. FILER can be deployed on cloud or local servers (https://bitbucket.org/wanglab-upenn/FILER) for integration with custom pipelines and is freely available (https://lisanwanglab.org/FILER).

Download Full-text

FILER: large-scale, harmonized FunctIonaL gEnomics Repository

10.1101/2021.01.22.427681 ◽

2021 ◽

Author(s):

Pavel P. Kuksa ◽

Prabhakaran Gangadharan ◽

Zivadin Katanic ◽

Lauren Kleidermacher ◽

Alexandre Amlie-Wolf ◽

...

Keyword(s):

Functional Genomics ◽

Large Scale ◽

Cell Types ◽

Data Sources ◽

Reproducible Research ◽

Functional Genomic ◽

Data Types ◽

Link Type ◽

Annotation Data ◽

Large Scale Analysis

AbstractMotivationQuerying massive collections of functional genomic and annotation data, linking and summarizing the query results across data sources and data types are important steps in high-throughput genomic and genetic analytical workflows. However, accomplishing these steps is difficult because of the heterogeneity and breadth of data sources, experimental assays, biological conditions (e.g., tissues, cell types), data types, and file formats.ResultsFunctIonaL gEnomics Repository (FILER) is a large-scale, harmonized functional genomics data catalog uniquely providing: 1) streamlined access to >50,000 harmonized, annotated functional genomic and annotation datasets across >20 integrated data sources, >1,100 biological conditions/tissues/cell types, and >20 experimental assays; 2) a scalable, indexing-based genomic querying interface; 3) ability for users to analyze and annotate their own experimental data against reference datasets. This rich resource spans >17 Billion genomic records for both GRCh37/hg19 and GRCh38/hg38 genome builds. FILER scales well with the experimental (query) data size and the number of reference datasets and data sources. When evaluated on large-scale analysis tasks, FILER demonstrated great efficiency as the observed running time for querying 1000x more genomic intervals (106 vs. 103) against all 7×109 hg19 FILER records increased sub-linearly by only a factor of 15x. Together, these features facilitate reproducible research and streamline querying, integrating, and utilizing large-scale functional genomics and annotation data.Availability and implementationFILER can be 1) freely accessed at https://lisanwanglab.org/FILER,2) deployed on cloud or local servers (https://bitbucket.org/wanglab-upenn/FILER), and 3) integrated with other pipelines using provided scripts.

Download Full-text

Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics

Nucleic Acids Research ◽

10.1093/nar/gkaa840 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D1311-D1320 ◽

Cited By ~ 1

Author(s):

Maya Ghoussaini ◽

Edward Mountjoy ◽

Miguel Carmona ◽

Gareth Peat ◽

Ellen M Schmidt ◽

...

Keyword(s):

Functional Genomics ◽

Large Scale ◽

Drug Repurposing ◽

Cell Types ◽

Chromatin Interaction ◽

The Public ◽

Wide Range ◽

Causal Genes ◽

Causal Variants ◽

Systematic Identification

Abstract Open Targets Genetics (https://genetics.opentargets.org) is an open-access integrative resource that aggregates human GWAS and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes. This enables systematic identification and prioritisation of likely causal variants and genes across all published trait-associated loci. In this paper, we describe the public resources we aggregate, the technology and analyses we use, and the functionality that the portal offers. Open Targets Genetics can be searched by variant, gene or study/phenotype. It offers tools that enable users to prioritise causal variants and genes at disease-associated loci and access systematic cross-disease and disease-molecular trait colocalization analysis across 92 cell types and tissues including the eQTL Catalogue. Data visualizations such as Manhattan-like plots, regional plots, credible sets overlap between studies and PheWAS plots enable users to explore GWAS signals in depth. The integrated data is made available through the web portal, for bulk download and via a GraphQL API, and the software is open source. Applications of this integrated data include identification of novel targets for drug discovery and drug repurposing.

Download Full-text

Learning a genome-wide score of human-mouse conservation at the functional genomics level

10.1101/2020.09.08.288092 ◽

2020 ◽

Author(s):

Soo Bin Kwon ◽

Jason Ernst

Keyword(s):

Mouse Model ◽

Functional Genomics ◽

Large Scale ◽

Computational Method ◽

Functional Genomic ◽

Model Studies ◽

Novel Approach ◽

Genome Wide ◽

A Genome ◽

Human And Mouse

AbstractIdentifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we take a novel approach and learn a score of evidence of conservation at the functional genomics level by integrating large-scale information in a compendium of epigenomic, transcription factor binding, and transcriptomic data from human and mouse. The computational method we developed to do this, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF), trains a neural network, which is then used to generate a genome-wide score in human and mouse. The resulting LECIF score highlights human and mouse regions with shared functional genomic properties and captures correspondence of biologically similar human and mouse annotations even though it was not explicitly given such information. LECIF will be a resource for mouse model studies.

Download Full-text

Data integration of structured and unstructured sources for assigning clinical codes to patient stays

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv115 ◽

2015 ◽

Vol 23 (e1) ◽

pp. e11-e19 ◽

Cited By ~ 15

Author(s):

Elyne Scheurwegs ◽

Kim Luyckx ◽

Léon Luyten ◽

Walter Daelemans ◽

Tim Van den Bulcke

Keyword(s):

Data Integration ◽

Large Scale ◽

Heterogeneous Data ◽

Early Data ◽

Data Sources ◽

Data Types ◽

Late Data ◽

Electronic Health Record Data ◽

Medical Specialties ◽

Electronic Health

Abstract Objective Enormous amounts of healthcare data are becoming increasingly accessible through the large-scale adoption of electronic health records. In this work, structured and unstructured (textual) data are combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. We investigate whether integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation. Methods Two separate data integration approaches were evaluated. Early data integration combines features of several sources within a single model, and late data integration learns a separate model per data source and combines these predictions with a meta-learner. This is evaluated on data sources and clinical codes from a broad set of medical specialties. Results When compared with the best individual prediction source, late data integration leads to improvements in predictive power (eg, overall F-measure increased from 30.6% to 38.3% for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes), while early data integration is less consistent. The predictive strength strongly differs between medical specialties, both for ICD-9-CM diagnostic and procedural codes. Discussion Structured data provides complementary information to unstructured data (and vice versa) for predicting ICD-9-CM codes. This can be captured most effectively by the proposed late data integration approach. Conclusions We demonstrated that models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties.

Download Full-text

Faculty Opinions recommendation of A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718350670.793493808 ◽

2014 ◽

Author(s):

Zdeněk Valenta

Keyword(s):

Functional Genomics ◽

Covariance Matrix ◽

Large Scale ◽

Covariance Matrix Estimation ◽

Matrix Estimation ◽

Scale Covariance

Download Full-text

Contribution of Ionotropic Glutamatergic Receptors to Excitability and Attentional Signals in Macaque Frontal Eye Field

Cerebral Cortex ◽

10.1093/cercor/bhab007 ◽

2021 ◽

Author(s):

Miguel Dasilva ◽

Christian Brandt ◽

Marc Alwin Gieselmann ◽

Claudia Distler ◽

Alexander Thiele

Keyword(s):

Attentional Control ◽

Large Scale ◽

Neuronal Excitability ◽

Cell Types ◽

Receptor Activation ◽

Frontal Eye Field ◽

Control Signals ◽

Cognitive Operations ◽

Eye Field ◽

Large Scale Networks

Abstract Top-down attention, controlled by frontal cortical areas, is a key component of cognitive operations. How different neurotransmitters and neuromodulators flexibly change the cellular and network interactions with attention demands remains poorly understood. While acetylcholine and dopamine are critically involved, glutamatergic receptors have been proposed to play important roles. To understand their contribution to attentional signals, we investigated how ionotropic glutamatergic receptors in the frontal eye field (FEF) of male macaques contribute to neuronal excitability and attentional control signals in different cell types. Broad-spiking and narrow-spiking cells both required N-methyl-D-aspartic acid and α-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid receptor activation for normal excitability, thereby affecting ongoing or stimulus-driven activity. However, attentional control signals were not dependent on either glutamatergic receptor type in broad- or narrow-spiking cells. A further subdivision of cell types into different functional types using cluster-analysis based on spike waveforms and spiking characteristics did not change the conclusions. This can be explained by a model where local blockade of specific ionotropic receptors is compensated by cell embedding in large-scale networks. It sets the glutamatergic system apart from the cholinergic system in FEF and demonstrates that a reduction in excitability is not sufficient to induce a reduction in attentional control signals.

Download Full-text

Transcriptional and morphological profiling of parvalbumin interneuron subpopulations in the mouse hippocampus

Nature Communications ◽

10.1038/s41467-020-20328-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Lin Que ◽

David Lukacsovich ◽

Wenshu Luo ◽

Csaba Földy

Keyword(s):

Large Scale ◽

Cell Types ◽

Rna Seq ◽

Neuronal Identity ◽

Parvalbumin Interneurons ◽

Different Types ◽

Parvalbumin Interneuron ◽

Cam Profile ◽

Developmental Domains

AbstractThe diversity reflected by >100 different neural cell types fundamentally contributes to brain function and a central idea is that neuronal identity can be inferred from genetic information. Recent large-scale transcriptomic assays seem to confirm this hypothesis, but a lack of morphological information has limited the identification of several known cell types. In this study, we used single-cell RNA-seq in morphologically identified parvalbumin interneurons (PV-INs), and studied their transcriptomic states in the morphological, physiological, and developmental domains. Overall, we find high transcriptomic similarity among PV-INs, with few genes showing divergent expression between morphologically different types. Furthermore, PV-INs show a uniform synaptic cell adhesion molecule (CAM) profile, suggesting that CAM expression in mature PV cells does not reflect wiring specificity after development. Together, our results suggest that while PV-INs differ in anatomy and in vivo activity, their continuous transcriptomic and homogenous biophysical landscapes are not predictive of these distinct identities.

Download Full-text

A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

Epidemiologia ◽

10.3390/epidemiologia2030024 ◽

2021 ◽

Vol 2 (3) ◽

pp. 315-324

Author(s):

Juan M. Banda ◽

Ramya Tekumalla ◽

Guanyu Wang ◽

Jingyuan Yu ◽

Tuo Liu ◽

...

Keyword(s):

Large Scale ◽

Social Dynamics ◽

Additional Data ◽

Open Data ◽

Data Sources ◽

Research Projects ◽

Research Groups ◽

The World ◽

Data Source

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

Download Full-text

Correcting for experiment-specific variability in expression compendia can remove underlying signals

GigaScience ◽

10.1093/gigascience/giaa117 ◽

2020 ◽

Vol 9 (11) ◽

Author(s):

Alexandra J Lee ◽

YoSon Park ◽

Georgia Doing ◽

Deborah A Hogan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Large Scale ◽

Original Signal ◽

Batch Effects ◽

Technical Variability ◽

The Past ◽

Statistical Correction ◽

Before And After ◽

Data Collections ◽

Biological Patterns

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

Download Full-text

Learning a genome-wide score of human–mouse conservation at the functional genomics level

Nature Communications ◽

10.1038/s41467-021-22653-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Soo Bin Kwon ◽

Jason Ernst

Keyword(s):

Mouse Model ◽

Functional Genomics ◽

Functional Genomic ◽

Transcriptomic Data ◽

Model Studies ◽

Genome Wide ◽

A Genome ◽

Important Challenge ◽

Genomic Regions ◽

Human And Mouse

AbstractIdentifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we develop a method to learn a score of evidence of conservation at the functional genomics level by integrating information from a compendium of epigenomic, transcription factor binding, and transcriptomic data from human and mouse. The method, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF), trains neural networks to generate this score for the human and mouse genomes. The resulting LECIF score highlights human and mouse regions with shared functional genomic properties and captures correspondence of biologically similar human and mouse annotations. Analysis with independent datasets shows the score also highlights loci associated with similar phenotypes in both species. LECIF will be a resource for mouse model studies by identifying loci whose functional genomic properties are likely conserved.

Download Full-text