scholarly journals A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC

2021 ◽  
Vol 17 (10) ◽  
pp. e1009459
Author(s):  
Jason Bennett ◽  
Mikhail Pomaznoy ◽  
Akul Singhania ◽  
Bjoern Peters

Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.

Author(s):  
Patrick Videau ◽  
Kaitlyn Wells ◽  
Arun Singh ◽  
Jessie Eiting ◽  
Philip Proteau ◽  
...  

Cyanobacteria are prolific producers of natural products and genome mining has shown that many orphan biosynthetic gene clusters can be found in sequenced cyanobacterial genomes. New tools and methodologies are required to investigate these biosynthetic gene clusters and here we present the use of <i>Anabaena </i>sp. strain PCC 7120 as a host for combinatorial biosynthesis of natural products using the indolactam natural products (lyngbyatoxin A, pendolmycin, and teleocidin B-4) as a test case. We were able to successfully produce all three compounds using codon optimized genes from Actinobacteria. We also introduce a new plasmid backbone based on the native <i>Anabaena</i>7120 plasmid pCC7120ζ and show that production of teleocidin B-4 can be accomplished using a two-plasmid system, which can be introduced by co-conjugation.


2019 ◽  
Author(s):  
Othman Soufan ◽  
Jessica Ewald ◽  
Charles Viau ◽  
Doug Crump ◽  
Markus Hecker ◽  
...  

There is growing interest within regulatory agencies and toxicological research communities to develop, test, and apply new approaches, such as toxicogenomics, to more efficiently evaluate chemical hazards. Given the complexity of analyzing thousands of genes simultaneously, there is a need to identify reduced gene sets.Though several gene sets have been defined for toxicological applications, few of these were purposefully derived using toxicogenomics data. Here, we developed and applied a systematic approach to identify 1000 genes (called Toxicogenomics-1000 or T1000) highly responsive to chemical exposures. First, a co-expression network of 11,210genes was built by leveraging microarray data from the Open TG-GATEs program. This network was then re-weighted based on prior knowledge of their biological (KEGG, MSigDB) and toxicological (CTD) relevance. Finally, weighted correlation network analysis was applied to identify 258 gene clusters. T1000 was defined by selecting genes from each cluster that were most associated with outcome measures. For model evaluation, we compared the performance of T1000 to that of other gene sets (L1000, S1500, Genes selected by Limma, and random set) using two external datasets. Additionally, a smaller (T384) and a larger version (T1500) of T1000 were used for dose-response modeling to test the effect of gene set size. Our findings demonstrated that the T1000 gene set is predictive of apical outcomes across a range of conditions (e.g.,in vitroand in vivo, dose-response, multiple species, tissues, and chemicals), and generally performs as well, or better than other gene sets available.


2020 ◽  
Author(s):  
Bahar Azari ◽  
Christiana Westlin ◽  
Ajay Satpute ◽  
J. Benjamin Hutchinson ◽  
Philip A. Kragel ◽  
...  

Machine learning methods provide powerful tools to map physical measurements to scientific categories. But are such methods suitable for discovering the ground truth about psychological categories? We use the science of emotion as a test case to explore this question. In studies of emotion, researchers use supervised classifiers, guided by emotion labels, to attempt to discover biomarkers in the brain or body for the corresponding emotion categories. This practice relies on the assumption that the labels refer to objective categories that can be discovered. Here, we critically examine this approach across three distinct datasets collected during emotional episodes- measuring the human brain, body, and subjective experience- and compare supervised classification studies with those from unsupervised clustering in which no a priori labels are assigned to the data. We conclude with a set of recommendations to guide researchers towards meaningful, data-driven discoveries in the science of emotion and beyond.


2019 ◽  
Vol 2 (1) ◽  
Author(s):  
Francesco Del Carratore ◽  
Konrad Zych ◽  
Matthew Cummings ◽  
Eriko Takano ◽  
Marnix H. Medema ◽  
...  

2015 ◽  
Vol 11 (4) ◽  
pp. 1012-1028
Author(s):  
Igor F. Tsigelny ◽  
Valentina L. Kouznetsova ◽  
Pengfei Jiang ◽  
Sandeep C. Pingle ◽  
Santosh Kesari

We report an integrative networks-based analysis to identify a system of coherent gene modules in primary and secondary glioblastoma.


2003 ◽  
Vol 185 (8) ◽  
pp. 2548-2554 ◽  
Author(s):  
Gwendolyn E. Wood ◽  
Andrew K. Haydock ◽  
John A. Leigh

ABSTRACT Methanococcus maripaludis is a mesophilic species of Archaea capable of producing methane from two substrates: hydrogen plus carbon dioxide and formate. To study the latter, we identified the formate dehydrogenase genes of M. maripaludis and found that the genome contains two gene clusters important for formate utilization. Phylogenetic analysis suggested that the two formate dehydrogenase gene sets arose from duplication events within the methanococcal lineage. The first gene cluster encodes homologs of formate dehydrogenase α (FdhA) and β (FdhB) subunits and a putative formate transporter (FdhC) as well as a carbonic anhydrase analog. The second gene cluster encodes only FdhA and FdhB homologs. Mutants lacking either fdhA gene exhibited a partial growth defect on formate, whereas a double mutant was completely unable to grow on formate as a sole methanogenic substrate. Investigation of fdh gene expression revealed that transcription of both gene clusters is controlled by the presence of H2 and not by the presence of formate.


2010 ◽  
Vol 95 (4) ◽  
pp. 1962-1971 ◽  
Author(s):  
Leonie van der Heul-Nieuwenhuijsen ◽  
Roos C. Padmos ◽  
Roosmarijn C. Drexhage ◽  
Harm de Wit ◽  
Arie Berghout ◽  
...  

Abstract Context: In monocytes of patients with autoimmune diabetes, we recently identified a gene expression fingerprint of two partly overlapping gene clusters, a PDE4B-associated cluster (consisting of 12 core proinflammatory cytokine/compound genes), a FABP5-associated cluster (three core genes), and a set of nine overlapping chemotaxis, adhesion, and cell assembly genes correlating to both PDE4B and FABP5. Objective: Our objective was to study whether a similar monocyte inflammatory fingerprint as found in autoimmune diabetes is present in autoimmune thyroid disease (AITD). Design and Patients: Quantitative PCR was used for analysis of 28 genes in monocytes of 67 AITD patients and 70 healthy controls. The tested 28 genes were the 24 genes previously found abnormally expressed in monocytes of autoimmune diabetes patients plus four extra genes found in whole-genome analysis of monocytes of AITD patients reported here. Results: Monocytes of 24% of AITD and 50% of latent autoimmune diabetes of adults (LADA) patients shared an inflammatory fingerprint consisting of the set of 24 genes of the PDE4B, FABP5, and overlapping gene sets. This study in addition revealed that FCAR, the gene for the Fcα receptor I, and PPBP, the gene for CXCL7, were part of this proinflammatory monocyte fingerprint. Conclusions: Our study provides an important tool to determine a shared, specific proinflammatory state of monocytes in AITD and LADA patients, enabling further research into the role of such proinflammatory cells in the failure to preserve tolerance in these conditions and of key fingerprint genes involved.


2017 ◽  
Author(s):  
Roberto Lozano ◽  
Dunia Pino del Carpio ◽  
Teddy Amuge ◽  
Ismail Siraj Kayondo ◽  
Alfred Ozimati Adebo ◽  
...  

AbstractBackgroundGenomic prediction models were, in principle, developed to include all the available marker information; with this approach, these models have shown in various crops moderate to high predictive accuracies. Previous studies in cassava have demonstrated that, even with relatively small training populations and low-density GBS markers, prediction models are feasible for genomic selection. In the present study, we prioritized SNPs in close proximity to genome regions with biological importance for a given trait. We used a number of strategies to select variants that were then included in single and multiple kernel GBLUP models. Specifically, our sources of information were transcriptomics, GWAS, and immunity-related genes, with the ultimate goal to increase predictive accuracies for Cassava Brown Streak Disease (CBSD) severity.ResultsWe used single and multi-kernel GBLUP models with markers imputed to whole genome sequence level to accommodate various sources of biological information; fitting more than one kinship matrix allowed for differential weighting of the individual marker relationships. We applied these GBLUP approaches to CBSD phenotypes (i.e., root infection and leaf severity three and six months after planting) in a Ugandan Breeding Population (n = 955). Three means of exploiting an established RNAseq experiment of CBSD-infected cassava plants were used. Compared to the biology-agnostic GBLUP model, the accuracy of the informed multi-kernel models increased the prediction accuracy only marginally (1.78% to 2.52%).ConclusionsOur results show that markers imputed to whole genome sequence level do not provide enhanced prediction accuracies compared to using standard GBS marker data in cassava. The use of transcriptomics data and other sources of biological information resulted in prediction accuracies that were nominally superior to those obtained from traditional prediction models.


2019 ◽  
Author(s):  
Aaron Ayllon-Benitez ◽  
Romain Bourqui ◽  
Patricia Thébaut ◽  
Fleur Mougin

AbstractThe revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze these massive data that are grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and may suffer from focusing on the most studied genes that represent a limited coverage of annotated genes within the gene set.To address these limitations, we developed GSAn, a novel gene set annotation Web server that uses semantic similarity measures to reduce a priori Gene Ontology annotation terms. The originality of this new approach is to identify the best compromise between the number of retained annotation terms that has to be drastically reduced and the number of related genes that has to be as large as possible. Moreover, GSAn offers interactive visualization facilities dedicated to the multi-scale analysis of gene set annotations. GSAn is available at: https://gsan.labri.fr.


2018 ◽  
Author(s):  
Adrià Fernández-Torras ◽  
Miquel Duran-Frigola ◽  
Patrick Aloy

AbstractBackgroundThe integration of large-scale drug sensitivity screens and genome-wide experiments is changing the field of pharmacogenomics, revealing molecular determinants of drug response without the need for previous knowledge about drug action. In particular, transcriptional signatures of drug sensitivity may guide drug repositioning, prioritize drug combinations and point to new therapeutic biomarkers. However, the inherent complexity of transcriptional signatures, with thousands of differentially expressed genes, makes them hard to interpret, thus giving poor mechanistic insights and hampering translation to clinics.MethodsTo simplify drug signatures, we have developed a network-based methodology to identify functionally coherent gene modules. Our strategy starts with the calculation of drug-gene correlations and is followed by a pathway-oriented filtering and a network-diffusion analysis across the interactome.ResultsWe apply our approach to 189 drugs tested in 671 cancer cell lines and observe a connection between gene expression levels of the modules and mechanisms of action of the drugs. Further, we characterize multiple aspects of the modules, including their functional categories, tissue-specificity and prevalence in clinics. Finally, we prove the predictive capability of the modules and demonstrate how they can be used as gene sets in conventional enrichment analyses.ConclusionsNetwork biology strategies like module detection are able to digest the outcome of large-scale pharmacogenomic initiatives, thereby contributing to their interpretability and improving the characterization of the drugs screened.


Sign in / Sign up

Export Citation Format

Share Document