scholarly journals A Novel Calibration Step in Gene Co-Expression Network Construction

2021 ◽  
Vol 1 ◽  
Author(s):  
Niloofar Aghaieabiane ◽  
Ioannis Koutis

High-throughput technologies such as DNA microarrays and RNA-sequencing are used to measure the expression levels of large numbers of genes simultaneously. To support the extraction of biological knowledge, individual gene expression levels are transformed to Gene Co-expression Networks (GCNs). In a GCN, nodes correspond to genes, and the weight of the connection between two nodes is a measure of similarity in the expression behavior of the two genes. In general, GCN construction and analysis includes three steps; 1) calculating a similarity value for each pair of genes 2) using these similarity values to construct a fully connected weighted network 3) finding clusters of genes in the network, commonly called modules. The specific implementation of these three steps can significantly impact the final output and the downstream biological analysis. GCN construction is a well-studied topic. Existing algorithms rely on relatively simple statistical and mathematical tools to implement these steps. Currently, software package WGCNA appears to be the most widely accepted standard. We hypothesize that the raw features provided by sequencing data can be leveraged to extract modules of higher quality. A novel preprocessing step of the gene expression data set is introduced that in effect calibrates the expression levels of individual genes, before computing pairwise similarities. Further, the similarity is computed as an inner-product of positive vectors. In experiments, this provides a significant improvement over WGCNA, as measured by aggregate p-values of the gene ontology term enrichment of the computed modules.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kyu-Sang Lim ◽  
Qian Dong ◽  
Pamela Moll ◽  
Jana Vitkovska ◽  
Gregor Wiktorin ◽  
...  

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.


2019 ◽  
Vol 13 ◽  
pp. 117793221988143 ◽  
Author(s):  
Kar-Fu Yeung ◽  
Yi Yang ◽  
Can Yang ◽  
Jin Liu

Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then ‘imputed’ using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM’s ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.


2006 ◽  
Vol 5 (10) ◽  
pp. 1713-1725 ◽  
Author(s):  
Philippe Leprohon ◽  
Danielle Légaré ◽  
Isabelle Girard ◽  
Barbara Papadopoulou ◽  
Marc Ouellette

ABSTRACT The ATP-binding cassette (ABC) protein superfamily is one of the largest evolutionarily conserved families and is found in all kingdoms of life. The recent completion of the Leishmania genome sequence allowed us to analyze and classify its encoded ABC proteins. The complete sequence predicts a data set of 42 open reading frames (ORFs) coding for proteins belonging to the ABC superfamily, with representative members of every major subfamily (from ABCA to ABCH) commonly found in eukaryotes. Comparative analysis showed that the same ABC data set is found between Leishmania major and Leishmania infantum and that some orthologues are found in the genome of the related parasites Trypanosoma brucei and Trypanosoma cruzi. Customized DNA microarrays were made to assess ABC gene expression profiling throughout the two main Leishmania life stages. Two ABC genes (ABCA3 and ABCG3) are preferentially expressed in the amastigote stage, whereas one ABC gene (ABCF3) is more abundantly expressed in promastigotes. Microarray-based expression profiling experiments also revealed that three ABC genes (ABCA3, ABCC3, and ABCH1) are overexpressed in two independent antimony-resistant strains compared to the parental sensitive strain. All microarray results were confirmed by real-time reverse transcription-PCR assays. The present study provides a thorough phylogenic classification of the Leishmania ABC proteins and sets the basis for further functional studies on this important class of proteins.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 629-629
Author(s):  
Yiming Zhou ◽  
Qing Zhang ◽  
Christoph Heuck ◽  
Owen Stephens ◽  
Erming Tian ◽  
...  

Abstract Abstract 629 Background: Cytogenetic abnormalities (CA) are a hallmark of multiple myeloma (MM) and other cancers and are commonly used as clinical parameters for determining disease stage and guiding therapy decisions. Traditional techniques, including fluorescence in situ hybridization (FISH) and karyotyping, and the recently developed array-based comparative genomic hybridization are expensive and time consuming. As gene expression profiling (GEP) is becoming more integrated in the diagnostic workup of MM and is increasingly being used for risk stratification as well as tailoring therapy, we are presented with vast amounts of data that should reflect disease associated alterations of the genome. We therefore sought to develop a GEP based vitual CA (vCA) model to predict CA in MM. Methods/Results: We determined genome-wide gene expression profiles and DNA copy numbers (CNs) in purified plasma cell samples obtained from 92 newly diagnosed MM patients, using the Affymetrix GeneChip and the Agilent aCGH platforms, respectively. We identified 1,114 CN-sensitive genes by Pearson's correlation coefficient (PCC) of gene expression levels and the copy numbers of the corresponding DNA loci, keeping the false discovery rate to <5%. On the basis of these CN-sensitive genes, we developed a vCA model for predicting CA in MM patients by means of GEP. The model focuses particularly on chromosomes 3, 5, 7, 9, 11, 13, 15, 19, and 21, as well as the 1p, 1q, and 6q segments, which are the most commonly altered chromosome regions in MM plasma cells. The reference CA (rCA) of a given chromosome region were determined by the mean values of signals of aCGH probes located in that region. The values of rCA could be used to distinguish among amplification, deletion, and normal. The predicted CA (pCA) of a given chromosome region were determined by the following procedures. First, we calculated the mean expression levels of CN-sensitive genes within the region. Then, by training the model in a GEP data set with 92 MM samples, we set the cutoff value of the mean expression levels of CN-sensitive genes for each chromosome region in order to obtain pCA that were most consistent with rCA in terms of the Matthews correlation coefficient, a measure of the quality of binary (two-class) classifications. The mean prediction accuracy was 0.88 (0.59–0.99) when the model was applied to the training data set. To check for overfitting in the vCA model, we applied the model to an independent data set of 23 MM samples for which both GEP and aCGH data were available. The mean prediction accuracy was 0.89 (0.74–1.00), which indicated that overfitting was negligible if present at all. We further validated the model with a FISH data set compiled from 262 independent MM samples for which both FISH records and GEP data were available. The mean prediction accuracy was 0.87. The consistency between vCA-predicted chromosomal alterations and findings of karyotyping dropped to 0.65. However, this underperformance could be due to the fact that karyotyping is limited by the low proliferation rate of terminally differentiated plasma cells in vitro. Conclusion: Our results provide a proof of concept that GEP data alone can reveal all the information provided by conventional cytogenetic techniques. We show that re-purposing gene expression data using our model is a fast and economical way to obtain cytogenetic information that is accurate and can be used for diagnosis and observation in MM and potentially other malignancies. GEP can serve as a one-stop genomic data source for information from the level of specific genes to whole chromosomes. Disclosures: Barlogie: Celgene: Consultancy, Honoraria, Research Funding; IMF: Consultancy, Honoraria; MMRF: Consultancy; Millennium: Consultancy, Honoraria, Research Funding; Genzyme: Consultancy; Novartis: Research Funding; NCI: Research Funding; Johnson & Johnson: Research Funding; Centocor: Research Funding; Onyx: Research Funding; Icon: Research Funding. Shaughnessy:Myeloma Health, Celgene, Genzyme, Novartis: Consultancy, Employment, Equity Ownership, Honoraria, Patents & Royalties.


Blood ◽  
2015 ◽  
Vol 126 (23) ◽  
pp. 2431-2431 ◽  
Author(s):  
Reem Karmali ◽  
Annesha Basu ◽  
Jeffrey A Borgia ◽  
Leo I. Gordon ◽  
Parameswaran Venugopal ◽  
...  

Abstract Background The "double hit" (DH) lymphomas that harbor a c-myc mutation and BCL2 translocation, or "double protein expressor" (DP) lymphomas that over-express c-myc and BCL2 proteins in the absence of a detectable mutation, have amongst the worst clinical outcomes as compared to patients with diffuse large B-cell lymphomas (DLBCL) that lack upregulation of the c-myc oncogene. Metformin can down-regulate translation of c-myc, making it an appropriate anti-cancer drug to explore in c-myc+ lymphomas. Furthermore, amethod to identify DH/DP patients most likely to benefit from metformin treatment has clinical relevance. Methods Within a publicly available gene expression array data set of R-CHOP treated DLBCL (n=232; GSE10846), a subset of DH/DP patients were defined as having above median expression of myc and BCL2 and below median expression of BCL6 as previously published by Horn et al. Survival analysis, significance analysis of microarrays (SAM) and gene set analysis (GSA) were performed characterizing the clinical, individual gene and biological ontology differences between DH/DP and non-DH/DP populations. Expression array data from a study testing metformin effects on THP-1 monocyte cells was reanalyzed using SAM and GSA as well. Changes in individual gene expression and overlapping ontological themes common to both GSA analyses of metformin effects on THP-1 cells and DH/DP characterization were identified. Genes with differential expression (DE) in both groups were evaluated topologically using a protein-protein interaction database to determine if any gene products had previously observed direct interactions. Network community detection identified tightly coupled signaling modules linking co-expression to mechanism. The resulting metformin-DH/DP network metagene was evaluated by k-means, clustering tumor samples into two groups over the metagene members in an independent data set of R-CHOP treated DLBCL patients (n=249; GSE32918) with differences in overall survival (OS) determined by log-rank. Results Of the 232 DLBCL patients treated with R-CHOP, 26 fit the criteria for DH/DP and had significantly lower OS (HR = 2.96; p < 0.001). In DH/DP tumors, 2780 genes had DE (2208 up-regulated; 572 down-regulated), enriched for biological processes related to transcription, metabolism and cytokine production and down-regulated for processes related to immune response, cell signaling, vascular development and proliferation (Fig. 1A). Analysis of metformin treated THP-1 cells relative to control identified 7123 genes with DE. Biological themes common to metformin treatment and DH/DP specific biology were identified including mitochondrial biogenesis, alternate splicing, and hormone secretion (Fig. 1A-B; highlighted in red). The intersection of genes with DE in metformin treated and DH/DP data sets identified 102 genes with direct interaction within a protein interaction network. Of the 19 communities detected by analyzing the resulting network topology, 3 showed significant correlation to survival in the GSE10846 data set (Fig. 2A, in red), forming a metformin-DH/DP metagene (Met-DH/DP-MG; n = 29 genes total). This metagene was validated by applying it to an independent cohort of R-CHOP treated DLBCL patients (n = 249), demonstrating 2 cluster groups (cluster 1, n=178; cluster 2, n=71; Fig. 2B) with differences in OS (HR = 1.61; p < 0.001; Fig. 2C). Conclusion We have identified a metagene of interacting proteins associated with both metformin therapeutic effect and OS in DH/DP patients. This offers a potential method for selecting patients most likely to benefit from metformin therapy and identifies mechanistic avenues by which metformin treatment may specifically benefit DH/DP patients. As such, in vitro studies using DH cell lines and a phase I/II clinical trial exploring chemo-immunotherapy with metformin as an adjunct in DH/DP lymphomas are underway. Disclosures No relevant conflicts of interest to declare.


2015 ◽  
Author(s):  
Benjamin K Johnson ◽  
Matthew B Scholz ◽  
Tracy K Teal ◽  
Robert B Abramovitch

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online


2015 ◽  
Author(s):  
Andrew Anand Brown ◽  
Zhihao Ding ◽  
Ana Viñuela ◽  
Dan Glass ◽  
Leopold Parts ◽  
...  

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 "pathway phenotypes" which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.


Author(s):  
Guro Dørum ◽  
Lars Snipen ◽  
Margrete Solheim ◽  
Solve Saebo

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.


2014 ◽  
Author(s):  
Jenny Tung ◽  
Xiang Zhou ◽  
Susan C Alberts ◽  
Matthew Stephens ◽  
Yoav Gilad

Gene expression variation is well documented in human populations and its genetic architecture has been extensively explored. However, we still know little about the genetic architecture of gene expression variation in other species, particularly our closest living relatives, the nonhuman primates. To address this gap, we performed an RNA sequencing (RNA-seq)-based study of 63 wild baboons, members of the intensively studied Amboseli baboon population in Kenya. Our study design allowed us to measure gene expression levels and identify genetic variants using the same data set, enabling us to perform complementary mapping of putative cis-acting expression quantitative trait loci (eQTL) and measurements of allele-specific expression (ASE) levels. We discovered substantial evidence for genetic effects on gene expression levels in this population. Surprisingly, we found more power to detect individual eQTL in the baboons relative to a HapMap human data set of comparable size, probably as a result of greater genetic variation, enrichment of SNPs with high minor allele frequencies, and longer-range linkage disequilibrium in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes. Interestingly, genes with eQTL significantly overlapped between the baboon and human data sets, suggesting that some genes may tolerate more genetic perturbation than others, and that this property may be conserved across species. Finally, we used a Bayesian sparse linear mixed model to partition genetic, demographic, and early environmental contributions to variation in gene expression levels. We found a strong genetic contribution to gene expression levels for almost all genes, while individual demographic and environmental effects tended to be more modest. Together, our results establish the feasibility of eQTL mapping using RNA-seq data alone, and act as an important first step towards understanding the genetic architecture of gene expression variation in nonhuman primates.


eLife ◽  
2015 ◽  
Vol 4 ◽  
Author(s):  
Jenny Tung ◽  
Xiang Zhou ◽  
Susan C Alberts ◽  
Matthew Stephens ◽  
Yoav Gilad

Primate evolution has been argued to result, in part, from changes in how genes are regulated. However, we still know little about gene regulation in natural primate populations. We conducted an RNA sequencing (RNA-seq)-based study of baboons from an intensively studied wild population. We performed complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression analyses, discovering substantial evidence for, and surprising power to detect, genetic effects on gene expression levels in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes; interestingly, genes with eQTL significantly overlapped between baboons and a comparable human eQTL data set. Our results suggest that genes vary in their tolerance of genetic perturbation, and that this property may be conserved across species. Further, they establish the feasibility of eQTL mapping using RNA-seq data alone, and represent an important step towards understanding the genetic architecture of gene expression in primates.


Sign in / Sign up

Export Citation Format

Share Document