A Method to Detect Differential Gene Expression in Cross-Species Hybridization Experiments at Gene and Probe Level

Motivation Whole genome microarrays are increasingly becoming the method of choice to study responses in model organisms to disease, stressors or other stimuli. However, whole genome sequences are available for only some model organisms, and there are still many species whose genome sequences are not yet available. Cross-species studies, where arrays developed for one species are used to study gene expression in a closely related species, have been used to address this gap, with some promising results. Current analytical methods have included filtration of some probes or genes that showed low hybridization activities. But consensus filtration schemes are still not available. Results A novel masking procedure is proposed based on currently available target species sequences to filter out probes and study a cross-species data set using this masking procedure and gene-set analysis. Gene-set analysis evaluates the association of some priori defined gene groups with a phenotype of interest. Two methods, Gene Set Enrichment Analysis (GSEA) and Test of Test Statistics (ToTS) were investigated. The results showed that masking procedure together with ToTS method worked well in our data set. The results from an alternative way to study cross-species hybridization experiments without masking are also presented. We hypothesize that the multi-probes structure of Affymetrix microarrays makes it possible to aggregate the effects of both well-hybridized and poorly-hybridized probes to study a group of genes. The principles of gene-set analysis were applied to the probe-level data instead of gene-level data. The results showed that ToTS can give valuable information and thus can be used as a powerful technique for analyzing cross-species hybridization experiments. Availability Software in the form of R code is available at http://anson.ucdavis.edu/~ychen/cross-species.html Supplementary Data Supplementary data are available at http://anson.ucdavis.edu/~ychen/cross-species.html

Download Full-text

Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods

Nucleic Acids Research ◽

10.1093/nar/gkt111 ◽

2013 ◽

Vol 41 (8) ◽

pp. 4378-4391 ◽

Cited By ~ 353

Author(s):

Leif Väremo ◽

Jens Nielsen ◽

Intawat Nookaew

Keyword(s):

Gene Expression ◽

Gene Set Analysis ◽

Gene Set ◽

Genome Wide ◽

Genome Wide Data ◽

Statistical Hypotheses

Download Full-text

Correction: Time-Course Gene Set Analysis for Longitudinal Gene Expression Data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1004446 ◽

2015 ◽

Vol 11 (8) ◽

pp. e1004446

Author(s):

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Time Course ◽

Gene Set Analysis ◽

Expression Data ◽

Gene Set

Download Full-text

Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1618 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Guro Dørum ◽

Lars Snipen ◽

Margrete Solheim ◽

Solve Saebo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Simulated Data ◽

Real Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Gene Set ◽

Network Information

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

BLAT-Based Comparative Analysis for Transposable Elements: BLATCAT

BioMed Research International ◽

10.1155/2014/730814 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7

Author(s):

Sangbum Lee ◽

Sumin Oh ◽

Keunsoo Kang ◽

Kyudong Han

Keyword(s):

Gene Expression ◽

Comparative Analysis ◽

Transposable Elements ◽

Rhesus Macaque ◽

Whole Genome ◽

Genome Sequences ◽

Comparative Analyses ◽

Manual Inspection ◽

Alignment Tool ◽

Specific Locus

The availability of several whole genome sequences makes comparative analyses possible. In primate genomes, the priority of transposable elements (TEs) is significantly increased because they account for ~45% of the primate genomes, they can regulate the gene expression level, and they are associated with genomic fluidity in their host genomes. Here, we developed the BLAST-like alignment tool (BLAT) based comparative analysis for transposable elements (BLATCAT) program. The BLATCAT program can compare specific regions of six representative primate genome sequences (human, chimpanzee, gorilla, orangutan, gibbon, and rhesus macaque) on the basis of BLAT and simultaneously carry out RepeatMasker and/or Censor functions, which are widely used Windows-based web-server functions to detect TEs. All results can be stored as a HTML file for manual inspection of a specific locus. BLATCAT will be very convenient and efficient for comparative analyses of TEs in various primate genomes.

Download Full-text

Tandem repeat interval pattern identifies animal taxa

Bioinformatics ◽

10.1093/bioinformatics/btab124 ◽

2021 ◽

Author(s):

Balaram Bhattacharyya ◽

Uddalak Mitra ◽

Ramkishore Bhattacharyya

Keyword(s):

Information Content ◽

Tandem Repeat ◽

Tandem Repeats ◽

Ordered Set ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

Genome Sequences ◽

Significant Achievement

Abstract Motivation We discover that maximality of information content among intervals of Tandem Repeats (TRs) in animal genome segregates over taxa such that taxa identification becomes swift and accurate. Successive TRs of a motif occur at intervals over the sequence, forming a trail of TRs of the motif across the genome. We present a method, Tandem Repeat Information Mining (TRIM), that mines 4k number of TR trails of all k length motifs from a whole genome sequence and extracts the information content within intervals of the trails. TRIM vector formed from the ordered set of interval entropies becomes instrumental for genome segregation. Results Reconstruction of correct phylogeny for animals from whole genome sequences proves precision of TRIM. Identification of animal taxa by TRIM vector upon feature selection is the most significant achievement. These suggest Tandem Repeat Interval Pattern (TRIP) is a taxa-specific constitutional characteristic in animal genome. Availabilityand implementation Source and executable code of TRIM along with usage manual are made available at https://github.com/BB-BiG/TRIM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Excitatory/inhibitory imbalance in autism: the role of glutamate and GABA gene-sets in symptoms and cortical brain structure

10.1101/2021.12.20.473501 ◽

2021 ◽

Author(s):

Viola Hollestein ◽

Geert Poelmans ◽

Natalie Forde ◽

Christian F Beckmann ◽

Christine Ecker ◽

...

Keyword(s):

Gene Expression ◽

Sensory Processing ◽

Brain Structure ◽

Symptom Severity ◽

Expression Profiles ◽

Brain Regions ◽

Gene Set Analysis ◽

Gene Set ◽

Gene Sets ◽

Autism Symptomatology

Background: The excitatory/inhibitory (E/I) imbalance hypothesis posits that an imbalance between excitatory (glutamatergic) and inhibitory (GABAergic) mechanisms underlies the behavioral characteristics of autism spectrum disorder (autism). However, how E/I imbalance arises and how it may differ across autism symptomatology and brain regions is not well understood. Methods: We used innovative analysis methods - combining competitive gene-set analysis and gene-expression profiles in relation to cortical thickness (CT)- to investigate the relationship between genetic variance, brain structure and autism symptomatology of participants from the EU-AIMS LEAP cohort (autism=360, male/female=259/101; neurotypical control participants=279, male/female=178/101) aged 6 to 30 years. Competitive gene-set analysis investigated associations between glutamatergic and GABAergic signaling pathway gene-sets and clinical measures, and CT. Additionally, we investigated expression profiles of the genes within those sets throughout the brain and how those profiles relate to differences in CT between autistic and neurotypical control participants in the same regions. Results: The glutamate gene-set was associated with all autism symptom severity scores on the Autism Diagnostic Observation Schedule-2 (ADOS-2) and the Autism Diagnostic Interview-Revised (ADI-R) within the autistic group, while the GABA set was associated with sensory processing measures (using the SSP subscales) across all participants. Brain regions with greater gene expression of both glutamate and GABA genes showed greater differences in CT between autistic and neurotypical control participants. Conclusions: Our results suggest crucial roles for glutamate and GABA genes in autism symptomatology as well as CT, where GABA is more strongly associated with sensory processing and glutamate more with autism symptom severity.

Download Full-text

Linear Combination Test for Hierarchical Gene Set Analysis

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1641 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 8

Author(s):

Xiaoming Wang ◽

Irina Dinu ◽

Wei Liu ◽

Yutaka Yasui

Keyword(s):

Gene Expression ◽

Linear Combination ◽

Covariance Matrix ◽

Gene Set Analysis ◽

Gene Set ◽

Hotelling's T2 ◽

Gene Sets ◽

Hotelling’S T2 ◽

Combination Test ◽

Linear Combination Test

Gene-set analysis (GSA) aims to identify sets of differentially expressed genes by a phenotype in DNA microarray studies. Challenges occur due to the salient characteristics of the data: (1) the number of genes is far larger than the number of observations; (2) gene expression measurements, especially within each gene set, can be highly correlated; and (3) the number of gene sets that can be examined is large and increasing rapidly. These challenges call for gene-set testing procedures that have both efficiency in computation for large GSAs and high power in the presence of the high correlation.We propose a new GSA approach called Linear Combination Test (LCT), incorporating the covariance matrix estimator of gene expression into the test statistic. The proposed LCT and two other GSA methods, a mod-ification of Hotelling’s T2 using a shrinkage covariance matrix and our SAM-GS (Dinu et. al. 2007), the two methods that have been reported by Tsai and Chen (2009) to perform best in terms of power, are evaluated in simulation studies and a real microarray study. The LCT method is more computationally efficient than the modified Hotelling’s T2 and approximates the superb power of the modified Hotelling’s T2. LCT is slightly faster than SAM-GS, but more powerful, due to incorporating the covariance matrix estimator. An extra step to enhance the interpretation of GSA results is also proposed in the form of a hierarchical LC (HLC) testing procedure, providing scientists useful hierarchical information on gene sets that LCT identified as differentially expressed.Availability: A free R-code to perform LCT-GSA and HLC test is available at http://www.ualberta.ca/~yyasui/homepage.html.

Download Full-text

Distance-correlation based gene set analysis in longitudinal studies

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0053 ◽

2018 ◽

Vol 17 (1) ◽

Author(s):

Jiehuan Sun ◽

Jose D. Herazo-Maya ◽

Xiu Huang ◽

Naftali Kaminski ◽

Hongyu Zhao

Keyword(s):

Gene Expression ◽

Disease Progression ◽

Clinical Outcomes ◽

Expression Profiles ◽

Gene Set Analysis ◽

Related Gene ◽

Disease Etiology ◽

Distance Correlation ◽

Gene Set ◽

Gene Sets

AbstractLongitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.

Download Full-text

TOXPANEL: A Gene-Set Analysis Tool to Assess Liver and Kidney Injuries

Frontiers in Pharmacology ◽

10.3389/fphar.2021.601511 ◽

2021 ◽

Vol 12 ◽

Author(s):

Patric Schyman ◽

Zhen Xu ◽

Valmik Desai ◽

Anders Wallqvist

Keyword(s):

Gene Expression ◽

Kidney Injury ◽

Fold Change ◽

Gene Set Analysis ◽

Analysis Tool ◽

Gene Set ◽

Physiological Range ◽

Gene Sets ◽

Liver And Kidney

Gene-set analysis is commonly used to identify trends in gene expression when cells, tissues, organs, or organisms are subjected to conditions that differ from those within the normal physiological range. However, tools for gene-set analysis to assess liver and kidney injury responses are less common. Furthermore, most websites for gene-set analysis lack the option for users to customize their gene-set database. Here, we present the ToxPanel website, which allows users to perform gene-set analysis to assess liver and kidney injuries using activation scores based on gene-expression fold-change values. The results are graphically presented to assess constituent injury phenotypes (histopathology), with interactive result tables that identify the main contributing genes to a given signal. In addition, ToxPanel offers the flexibility to analyze any set of custom genes based on gene fold-change values. ToxPanel is publically available online at https://toxpanel.bhsai.org. ToxPanel allows users to access our previously developed liver and kidney injury gene sets, which we have shown in previous work to yield robust results that correlate with the degree of injury. Users can also test and validate their customized gene sets using the ToxPanel website.

Download Full-text

CNCDatabase: a database of non-coding cancer drivers

10.1101/2020.04.29.069047 ◽

2020 ◽

Author(s):

Eric Minwei Liu ◽

Alexander Martinez-Fundichely ◽

Rajesh Bollapragada ◽

Maurice Spiewack ◽

Ekta Khurana

Keyword(s):

Gene Expression ◽

Luciferase Reporter ◽

Whole Genome ◽

Gene Promoters ◽

Genome Sequences ◽

Coding Regions ◽

Cancer Types ◽

Cancer Drivers ◽

Non Coding Rnas ◽

Experimental Validations

ABSTRACTMost mutations in cancer genomes occur in the non-coding regions with unknown impact to tumor development. Although the increase in number of cancer whole-genome sequences has revealed numerous putative non-coding cancer drivers, their information is dispersed across multiple studies and thus it is difficult to bridge the understanding of non-coding alterations, the genes they impact and the supporting evidence for their role in tumorigenesis across multiple cancer types. To address this gap, we have developed CNCDatabase, Cornell Non-Coding Cancer driver Database (https://cncdatabase.med.cornell.edu/) that contains detailed information about predicted non-coding drivers at gene promoters, 5’ and 3’ UTRs (untranslated regions), enhancers, CTCF insulators and non-coding RNAs. CNCDatabase documents 1,111 protein-coding genes and 90 non-coding RNAs with reported drivers in their non-coding regions from 32 cancer types by computational predictions of positive selection in whole-genome sequences; differential gene expression in samples with and without mutations; or another set of experimental validations including luciferase reporter assays and genome editing. The database can be easily modified and scaled as lists of non-coding drivers are revised in the community with larger whole-genome sequencing studies, CRISPR screens and further experimental validations. Overall, CNCDatabase provides a helpful resource for researchers to explore the pathological role of non-coding alterations and their associations with gene expression in human cancers.

Download Full-text