scholarly journals The XL-mHG test for gene set enrichment

Author(s):  
Florian Wagner

The nonparametric minimum hypergeometric (mHG) test is a popular alternative to Kolmogorov-Smirnov (KS)-type tests for determining gene set enrichment. However, these approaches have not been compared to each other in a quantitative manner. Here, I first perform a simulation study to show that the mHG test is significantly more powerful than the one-sided KS test for detecting gene set enrichment. I then illustrate a shortcoming of the mHG test, which has motivated a semiparametric generalization of the test, termed the XL-mHG test. I describe an improved quadratic-time algorithm for the efficient calculation of exact XL-mHG p-values, as well as a linear-time algorithm for calculating a tighter upper bound for the p-value. Finally, I demonstrate that the XL-mHG test outperforms the one-sided KS test when applied to a reference gene expression study, and discuss general principles for analyzing gene set enrichment using the XL-mHG test. An efficient open-source Python/Cython implementation of the XL-mHG test is provided in the xlmhg package, available from PyPI and GitHub (https://github.com/flo-compbio/xlmhg) under an OSI-approved license.

2017 ◽  
Author(s):  
Florian Wagner

The nonparametric minimum hypergeometric (mHG) test is a popular alternative to Kolmogorov-Smirnov (KS)-type tests for determining gene set enrichment. However, these approaches have not been compared to each other in a quantitative manner. Here, I first perform a simulation study to show that the mHG test is significantly more powerful than the one-sided KS test for detecting gene set enrichment. I then illustrate a shortcoming of the mHG test, which has motivated a semiparametric generalization of the test, termed the XL-mHG test. I describe an improved quadratic-time algorithm for the efficient calculation of exact XL-mHG p-values, as well as a linear-time algorithm for calculating a tighter upper bound for the p-value. Finally, I demonstrate that the XL-mHG test outperforms the one-sided KS test when applied to a reference gene expression study, and discuss general principles for analyzing gene set enrichment using the XL-mHG test. An efficient open-source Python/Cython implementation of the XL-mHG test is provided in the xlmhg package, available from PyPI and GitHub (https://github.com/flo-compbio/xlmhg) under an OSI-approved license.


2020 ◽  
Author(s):  
John Spouge ◽  
Joseph M. Ziegelbauer ◽  
Mileidy Gonzalez

Abstract [Please see the manuscript file pdf to view the full abstract.]Background: Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given elements in a set with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products ( ).Results: This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like ; its novel downward phase mirrors the upward phase while exploiting the symmetry of and its complement . The algorithm requires storage for elements of and only about products. In contrast, the standard segment tree algorithms require about products for construction and products for calculating each , i.e., about products in total; and a naïve quadratic algorithm using element-by-element products to compute each requires products.Conclusions: In the herpesvirus application, the Jackknife Product algorithm required 15 minutes; standard segment tree algorithms would have taken an estimated 3 hours; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm has many possible uses in bioinformatics and statistics.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 3448-3448
Author(s):  
Harumi Kato ◽  
Kazuhito Yamamoto ◽  
Kennosuke Karube ◽  
Miyuki Katayama ◽  
Shinobu Tsuzuki ◽  
...  

Abstract Abstract 3448 Age-related EBV-associated B-cell lymphoproliferative disorder (AR-EBLPD) is classified as a subtype of diffuse large cell lymphoma (DLBCL) according to the WHO classification. However, molecular genetic characterization of AR-EBLPD remains largely unknown. We studied expression profiles of 5 AR-EBLPD and 8 EB-negative DLBCL samples using the Agilent 44K human oligonucleotide microarray. Total RNA was extracted from fresh-frozen tumor samples. Each microarray slide was converted into datasets using the Agilent Micro Array Scanner and Feature extractions. Data was standardized with Z-scores. Differences in mRNA expression levels between two sample groups were calculated using a two-sided t-test. A total of 1973 probes showed a p-value less than 0.05 with less than a 25% false discovery rate (FDR). These probes included 1688 genes. The number of probes showing high expression in AR-EBLPD and EB-negative DLBCL was 804 (693 genes) and 1169 (995 genes), respectively. First, we selected the top 300 differentially expressed genes. Genes highly expressed in AR-EBLPD included IL6, TNFAIP3, HOPX, and SLAMF1. IL6 is known as a gene encoding a cytokine which functions in inflammation and the maturation of B lymphocytes, and TNFAIP3 is known as a negative regulatory gene of the NF-kB pathway. HOPX and SLAMF1 are reported as genes related to lymphocyte function or the immune system (Schwartzberg et al. Nature immunology 2009, Hawiger et al. Nature immunology 2011). For better characterization, we next performed Gene Ontology Analysis using the WEB-based GEne SeT AnaLysis Toolkit and found that categories of external stimulus and inflammatory responses were enriched in AR-EBLPD. The Kyoto Encyclopedia of Genes and Genomes (KEGG)-signaling analyses showed that pathways of the NOD-like receptor (p-value =1.30e-06), JAK-STAT (p-value =9.01e-06), and Toll-like receptor (p-value =0.0002) were characteristic of AR-EBLPD. These results implied that inflammation would be prominent in AR-EBLPD cases. For validation, we next performed Gene Set Enrichment Analysis (GSEA) using all the database of KEGG pathways (186 gene sets). Dominant gene sets in AR-EBLPD included the cytokine-cytokine receptor interaction [Normalized Enrichment Score (NES) =2.66, p-value<0.001], NOD-like receptor pathway (NES =2.26, p-value<0.001), TOLL-like receptor pathway (NES =2.14, p-value<0.001), and JAK-STAT pathway (NES =1.79, p-value<0.001). Since all the pathways were related to the NF-kB pathway, inflammatory responses were suggested to activate the NF-kB pathway or vice versa. For confirmation, we finally performed GSEA using gene sets of the NF-kB pathway, which were obtained from a gene set reported by an NIH group (Puente et al. Nature 2011) and 30 gene sets in the GSEA database, and found that the gene sets of the NF-kB pathway were enriched in AR-EBLPD (Figure 1). Our results suggested that the inflammatory and immune-related genes were enriched in AR-EBLPD and that activation of the genes may be associated with NF-kB activation. Aberrant immune and inflammatory responses could define the clinical presentations of AR-EBLPD cases. (Figure 1) Gene Set Enrichment Analysis of 5 AR-EBLPD and 8 EB-negative DLBCL samples. The NF-kB signature reported from an NIH group (Puente et al. Nature 2011) was enriched in AR-EBLPD [Normalized Enrichment Score (NES) =2.20, p-value<0.001]. Disclosures: No relevant conflicts of interest to declare.


2009 ◽  
Vol 01 (03) ◽  
pp. 319-333 ◽  
Author(s):  
HUAMING ZHANG ◽  
MILIND VAIDYA

Irreducible triangulations are plane graphs with a quadrangular exterior face, triangular interior faces and no separating triangles. Fusy proposed a straight-line grid drawing algorithm for irreducible triangulations, whose grid size is asymptotically with high probability 11n/27 × 11n/27 up to an additive error of [Formula: see text]. Later on, Fusy generalized the idea to quadrangulations and obtained a straight-line grid drawing, whose grid size is asymptotically with high probability 13n/27 × 13n/27 up to an additive error of [Formula: see text]. In this paper, we first prove that the above two straight-line grid drawing algorithms for irreducible triangulations and quadrangulations actually produce open rectangle-of-influence drawings for them respectively. Therefore, the above mentioned straight-line grid drawing size bounds also hold for the open rectangle-of-influence drawings. These results improve previous known drawing sizes. In the second part of the paper, we present another application of the results obtained by Fusy. We present a linear time algorithm for constructing a rectangular dual for a randomly generated irreducible triangulation with n vertices, one of its dimensions equals [Formula: see text] asymptotically with high probability, up to an additive error of [Formula: see text]. In addition, we prove that the one dimension tight bound for a rectangular dual of any irreducible triangulations with n vertices is (n + 1)/2.


2007 ◽  
Vol 18 (05) ◽  
pp. 911-930 ◽  
Author(s):  
RYUHEI UEHARA ◽  
YUSHI UNO

The longest path problem is the one that finds a longest path in a given graph. While the graph classes in which the Hamiltonian path problem can be solved efficiently are widely investigated, few graph classes are known to be solved efficiently for the longest path problem. Among those, for trees, a simple linear time algorithm for the longest path problem is known. We first generalize the algorithm, and show that the longest path problem can be solved efficiently for some tree-like graph classes by this approach. We next propose two new graph classes that have natural interval representations, and show that the longest path problem can be solved efficiently on these classes.


2020 ◽  
Author(s):  
Bruno P. Masquio ◽  
Paulo E. D. Pinto ◽  
Jayme L. Szwarcfiter

Graph matching problems are well known and studied, in which we want to find sets of pairwise non-adjacent edges. Recently, there has been an interest in the study of matchings in which the induced subgraphs by the vertices of matchings are connected or disconnected. Although these problems are related to connectivity, the two problems are probably quite different, regarding their complexity. While the complexity of finding a maximum disconnected mat- ching is still unknown for a general graph, the one for connected matchings can be solved in polynomial time. Our contribution in this paper is a linear time algorithm to find a maximum connected matching of a general connected graph, given a general maximum matching as input.


2020 ◽  
Author(s):  
John Spouge ◽  
Joseph M. Ziegelbauer ◽  
Mileidy Gonzalez

Abstract Background: Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given n elements g0,g1,...gn-1 in a set with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products gbar;=g0,g1,...gj-1 g j+1...gn-1 (0&le;j<n).Results: This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like g[i,j)=gigi+1...gj-1; its novel downward phase mirrors the upward phase while exploiting the symmetry of and its complement gbar;j. The algorithm requires storage for elements of and only about products. In contrast, the standard segment tree algorithms require about n products for construction and log2 n products for calculating each gbar;j, i.e., about products n log n in total; and a naïve quadratic algorithm using n-2 element-by-element products to compute each gbar;j requires n (n-2) products.Conclusions: In the herpesvirus application, the Jackknife Product algorithm required 15 minutes; standard segment tree algorithms would have taken an estimated 3 hours; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm has many possible uses in bioinformatics and statistics.


2020 ◽  
Author(s):  
John Spouge ◽  
Joseph M. Ziegelbauer ◽  
Mileidy Gonzalez

Abstract Background: Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given elements in a set with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products ( ).Results: This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like ; its novel downward phase mirrors the upward phase while exploiting the symmetry of and its complement . The algorithm requires storage for elements of and only about products. In contrast, the standard segment tree algorithms require about products for construction and products for calculating each , i.e., about products in total; and a naïve quadratic algorithm using element-by-element products to compute each requires products.Conclusions: In the herpesvirus application, the Jackknife Product algorithm required 15 minutes; standard segment tree algorithms would have taken an estimated 3 hours; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm has many possible uses in bioinformatics and statistics.


2020 ◽  
Vol 133 (5) ◽  
pp. 1060-1076
Author(s):  
Congli Zeng ◽  
Gabriel C. Motta-Ribeiro ◽  
Takuga Hinoshita ◽  
Marcos Adriano Lessa ◽  
Tilo Winkler ◽  
...  

Background Pulmonary atelectasis is frequent in clinical settings. Yet there is limited mechanistic understanding and substantial clinical and biologic controversy on its consequences. The authors hypothesize that atelectasis produces local transcriptomic changes related to immunity and alveolar–capillary barrier function conducive to lung injury and further exacerbated by systemic inflammation. Methods Female sheep underwent unilateral lung atelectasis using a left bronchial blocker and thoracotomy while the right lung was ventilated, with (n = 6) or without (n = 6) systemic lipopolysaccharide infusion. Computed tomography guided samples were harvested for NextGen RNA sequencing from atelectatic and aerated lung regions. The Wald test was used to detect differential gene expression as an absolute fold change greater than 1.5 and adjusted P value (Benjamini–Hochberg) less than 0.05. Functional analysis was performed by gene set enrichment analysis. Results Lipopolysaccharide-unexposed atelectatic versus aerated regions presented 2,363 differentially expressed genes. Lipopolysaccharide exposure induced 3,767 differentially expressed genes in atelectatic lungs but only 1,197 genes in aerated lungs relative to the corresponding lipopolysaccharide-unexposed tissues. Gene set enrichment for immune response in atelectasis versus aerated tissues yielded negative normalized enrichment scores without lipopolysaccharide (less than –1.23, adjusted P value less than 0.05) but positive scores with lipopolysaccharide (greater than 1.33, adjusted P value less than 0.05). Leukocyte-related processes (e.g., leukocyte migration, activation, and mediated immunity) were enhanced in lipopolysaccharide-exposed atelectasis partly through interferon-stimulated genes. Furthermore, atelectasis was associated with negatively enriched gene sets involving alveolar–capillary barrier function irrespective of lipopolysaccharide (normalized enrichment scores less than –1.35, adjusted P value less than 0.05). Yes-associated protein signaling was dysregulated with lower nuclear distribution in atelectatic versus aerated lung (lipopolysaccharide-unexposed: 10.0 ± 4.2 versus 13.4 ± 4.2 arbitrary units, lipopolysaccharide-exposed: 8.1 ± 2.0 versus 11.3 ± 2.4 arbitrary units, effect of lung aeration, P = 0.003). Conclusions Atelectasis dysregulates the local pulmonary transcriptome with negatively enriched immune response and alveolar–capillary barrier function. Systemic lipopolysaccharide converts the transcriptomic immune response into positive enrichment but does not affect local barrier function transcriptomics. Interferon-stimulated genes and Yes-associated protein might be novel candidate targets for atelectasis-associated injury. Editor’s Perspective What We Already Know about This Topic What This Article Tells Us That Is New


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 4212-4212
Author(s):  
Malathi Kandarpa ◽  
Kristen Pettit ◽  
Tingting Qin ◽  
Yi-Mi Wu ◽  
Dan Robinson ◽  
...  

INTRODUCTION: The molecular basis of Philadelphia chromosome negative myeloproliferative neoplasms (MPNs) is unclear. So-called "driver" mutations in JAK2, CALR, or MPL are present in the vast majority of cases, but there is no compelling evidence to explain how each mutant gene can lead to phenotypically distinct and/or overlapping disease phenotypes. In an attempt to understand the molecular events that underlay clinical characteristics of MPNs, we studied gene expression profiles and sequenced hematopoietic cells of MPN patients with a focus on myelofibrosis (MF) landscape. METHODS: Patients were consented to MI-ONCOSEQ study approved by the University of Michigan IRB. Peripheral blood or bone marrow aspirates were either enriched for CD34 expressing cells or peripheral blood or bone marrow mononuclear cells (PBMC/BMMC) were analyzed. MI-ONCOSEQ is a Next Generation Sequencing platform to identify genetic aberrations in 1,711 genes and gene expression and fusion analysis of 24,774 capture targets by transcriptome sequencing. Gene expression data was analyzed between various sub-groups of MF based upon clinical (spleen size) or molecular (mutations in Ras pathway genes) characteristics. Gene set enrichment analysis was conducted for the MF cohorts (23 CD34 enriched, 76 PBMC/BMMC) and ET/PV/PrePMF cohort (12 CD34, 35 PBMC/BMMC). RESULTS: We analyzed the genetic landscape of 163 patients with MPNs: 113 with overt MF and 50 with ET (18), PV (23), or prePMF (9). In addition to driver genes, 183 other gene variants were observed. The number of gene variants was higher in older patients (median 4 and 5 variants in those aged 70-80 and >80 yrs respectively) with MF as compared to ET/PV/PrePMF where the median number of variants did not exceed 3. Mutations in ASXL1, TET2, RAS and SRSF2 increased in frequency with age (Fig 1). Gene expression profiles of sub-groups of MF were further analyzed to understand aberrantly regulated molecular pathways. Hierarchical clustering of all MF patients showed that CD34 enriched samples to be distinct from the PBMC/BMMC cohort and therefore these cohorts were analyzed separately. Moreover, hierarchical clustering suggested differences in patients with large spleens. In the comparison of MF to ET/PV/PrePMF within the CD34 population gene enrichment highlights hemopoiesis, leukemia related pathways as well as endoplasmic reticulum and Golgi transport pathways. Previously, we saw that RAS pathway mutations predicted proliferative disease with high WBC counts. Therefore, we focused on RAS pathway mutated cohorts versus RAS wild type cohorts and identified dysregulated pathways by gene set enrichment analysis (Fig 2). The RAS mutated MF cohort (PBMC/BMMC fraction) showed up-regulation of cytokines IL6 (p-value 9.39E-06), IL8 (p-value 1.16E-04), and IL1beta (p-value 5.24E-04) and down-regulation of the TNF superfamily (p-value 9.98E-04). Most notably, there was up regulation of NFkB transport to the nucleus (p-value 8.69E-05) and transcription factor activity. In general, several metabolic pathways were affected and inflammatory pathways were up-regulated. Since spleen size is an indicator of disease severity, progression and response to therapy, gene set enrichment between cohorts of patients with different spleen size (>6cm by physical exam versus <6cm) was analyzed (Fig 3). The data suggest dysregulation of megakaryocyte differentiation (p-value 2.05E-05), cytokine production (p-value 9.09E-05) and signaling, JAK-STAT pathway (p-value 5.10E-05), RAS signaling (p-value 7.67E-05) and NFkB pathway (p-value 1.31E-04) in patients with larger spleens. CONCLUSIONS: Age is a high risk for many hematological malignancies. We analyzed the number of genetic variants by age and determined that accumulation of higher number of mutations might indicate why disease progresses rapidly in older patients. Not only genetic variants, but gene expression changes also contribute to the pathogenesis of MF. Analysis of gene expression changes show enrichment of genes regulating JAK-STAT pathway activity and cytokine production as anticipated, but also implicates epigenetic regulators and the RAS signaling pathway in disease biology. Moreover, enriched pathways in gene expression analysis underscore the dysregulation of NFkB, perhaps as a result of inflammatory response. Thus these pathways are promising candidates for intervention in patients with MF. Disclosures Pettit: Samus Therapeutics: Research Funding. Talpaz:Imago BioSciences: Consultancy, Research Funding; Celgene: Consultancy, Research Funding; CTI BioPharma: Research Funding; Constellation: Research Funding; Incyte: Research Funding; Novartis: Research Funding; Samus Therapeutics: Research Funding.


Sign in / Sign up

Export Citation Format

Share Document