scholarly journals NormExpression: an R package to normalize gene expression data using evaluated methods

2018 ◽  
Author(s):  
Zhenfeng Wu ◽  
Weixiang Liu ◽  
Xiufeng Jin ◽  
Deshui Yu ◽  
Hua Wang ◽  
...  

AbstractData normalization is a crucial step in the gene expression analysis as it ensures the validity of its downstream analyses. Although many metrics have been designed to evaluate the current normalization methods, the different metrics yield inconsistent results. In this study, we designed a new metric named Area Under normalized CV threshold Curve (AUCVC) and applied it with another metric mSCC to evaluate 14 commonly used normalization methods, achieving consistency in our evaluation results using both bulk RNA-seq and scRNA-seq data from the same library construction protocol. This consistency has validated the underlying theory that a sucessiful normalization method simultaneously maximizes the number of uniform genes and minimizes the correlation between the expression profiles of gene pairs. This consistency can also be used to analyze the quality of gene expression data. The gene expression data, normalization methods and evaluation metrics used in this study have been included in an R package named NormExpression. NormExpression provides a framework and a fast and simple way for researchers to evaluate methods (particularly some data-driven methods or their own methods) and then select a best one for data normalization in the gene expression analysis.

Author(s):  
Soumya Raychaudhuri

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.


2019 ◽  
Vol 10 ◽  
Author(s):  
Zhenfeng Wu ◽  
Weixiang Liu ◽  
Xiufeng Jin ◽  
Haishuo Ji ◽  
Hua Wang ◽  
...  

Author(s):  
Vincent S. Tseng ◽  
Ching-Pin Kao

In recent years, clustering analysis has even become a valuable and useful tool for in-silico analysis of microarray or gene expression data. Although a number of clustering methods have been proposed, they are confronted with difficulties in meeting the requirements of automation, high quality, and high efficiency at the same time. In this chapter, we discuss the issue of parameterless clustering technique for gene expression analysis. We introduce two novel, parameterless and efficient clustering methods that fit for analysis of gene expression data. The unique feature of our methods is they incorporate the validation techniques into the clustering process so that high quality results can be obtained. Through experimental evaluation, these methods are shown to outperform other clustering methods greatly in terms of clustering quality, efficiency, and automation on both of synthetic and real data sets.


2010 ◽  
Vol 9 ◽  
pp. CIN.S3851 ◽  
Author(s):  
Tsuyoshi Yoshida ◽  
Takumi Kobayashi ◽  
Masaya Itoda ◽  
Taika Muto ◽  
Ken Miyaguchi ◽  
...  

Background Colorectal cancer (CRC) is one of the most frequently occurring cancers in Japan, and thus a wide range of methods have been deployed to study the molecular mechanisms of CRC. In this study, we performed a comprehensive analysis of CRC, incorporating copy number aberration (CRC) and gene expression data. For the last four years, we have been collecting data from CRC cases and organizing the information as an “omics” study by integrating many kinds of analysis into a single comprehensive investigation. In our previous studies, we had experienced difficulty in finding genes related to CRC, as we observed higher noise levels in the expression data than in the data for other cancers. Because chromosomal aberrations are often observed in CRC, here, we have performed a combination of CNA analysis and expression analysis in order to identify some new genes responsible for CRC. This study was performed as part of the Clinical Omics Database Project at Tokyo Medical and Dental University. The purpose of this study was to investigate the mechanism of genetic instability in CRC by this combination of expression analysis and CNA, and to establish a new method for the diagnosis and treatment of CRC. Materials and methods Comprehensive gene expression analysis was performed on 79 CRC cases using an Affymetrix Gene Chip, and comprehensive CNA analysis was performed using an Affymetrix DNA Sty array. To avoid the contamination of cancer tissue with normal cells, laser micro-dissection was performed before DNA/RNA extraction. Data analysis was performed using original software written in the R language. Result We observed a high percentage of CNA in colorectal cancer, including copy number gains at 7, 8q, 13 and 20q, and copy number losses at 8p, 17p and 18. Gene expression analysis provided many candidates for CRC-related genes, but their association with CRC did not reach the level of statistical significance. The combination of CNA and gene expression analysis, together with the clinical information, suggested UGT2B28, LOC440995, CXCL6, SULT1B1, RALBP1, TYMS, RAB12, RNMT, ARHGDIB, S1000A2, ABHD2, OIT3 and ABHD12 as genes that are possibly associated with CRC. Some of these genes have already been reported as being related to CRC. TYMS has been reported as being associated with resistance to the anti-cancer drug 5-fluorouracil, and we observed a copy number increase for this gene. RALBP1, ARHGDIB and S100A2 have been reported as oncogenes, and we observed copy number increases in each. ARHGDIB has been reported as a metastasis-related gene, and our data also showed copy number increases of this gene in cases with metastasis. Conclusion The combination of CNA analysis and gene expression analysis was a more effective method for finding genes associated with the clinicopathological classification of CRC than either analysis alone. Using this combination of methods, we were able to detect genes that have already been associated with CRC. We also identified additional candidate genes that may be new markers or targets for this form of cancer.


2016 ◽  
Vol 16 (1) ◽  
pp. 48-73 ◽  
Author(s):  
Svenja Simon ◽  
Sebastian Mittelstädt ◽  
Bum Chul Kwon ◽  
Andreas Stoffel ◽  
Richard Landstorfer ◽  
...  

Biologists are keen to understand how processes in cells react to environmental changes. Differential gene expression analysis allows biologists to explore functions of genes with data generated from different environments. However, these data and analysis lead to unique challenges since tasks are ill-defined, require implicit domain knowledge, comprise large volumes of data, and are, therefore, of explanatory nature. To investigate a scalable visualization-based solution, we conducted a design study with three biologists specialized in differential gene expression analysis. We stress our contributions in three aspects: first, we characterize the problem domain for exploring differential gene expression data and derive task abstractions and design requirements. Second, we investigate the design space and present an interactive visualization system, called VisExpress. Third, we evaluate the usefulness of VisExpress via a Pair Analytics study with real users and real data and report on insights that were gained by our experts with VisExpress.


2016 ◽  
Vol 3 (3) ◽  
pp. 51-59 ◽  
Author(s):  
Gerald Schaefer

Microarray studies and gene expression analysis have received significant attention over the last few years and provide many promising avenues towards the understanding of fundamental questions in biology and medicine. In this paper, the authors investigate the application of ant colony optimisation (ACO) based classification for the analysis of gene expression data. They employ cAnt-Miner, a variation of the classical Ant-Miner classifier, which is capable of interpreting the numerical gene expression data. Experimental results on well-known gene expression datasets show that the ant-based approach is capable of extracting a compact rule base while providing good classification performance.


Blood ◽  
2006 ◽  
Vol 108 (11) ◽  
pp. 4291-4291
Author(s):  
Patricia Alvarez ◽  
Pilar Saenz ◽  
David Arteta ◽  
Antonio Martiez ◽  
Miguel Pocovi ◽  
...  

Abstract High density microarrays (HDM) are powerful tools for simultaneously profiling the expression levels of thousands of genes. The application of this technology to study of neoplastic hematological disorders.has identified new sub groups of disease not related previously and new prognosis markers. However there is a limited experience in the gene expression studies using low density microarrays (LDM) in neoplastic hematological disorders. A gene expression analysis system based on a LDM containing 538 oligonucleotides has been developed. Whole technical process was optimized to improve the analysis of differential expression. We have analyzed mRNA from cell line cultures (Jurkat, U937), whole blood samples from healthy subjects and different hematological malignancies (HM) using this chip. A hierarchical clustering procedure applying Welch t-statistics with Bonferroni correction was used to analysis gene expression data The LDM generated a linear response of 2 magnitude orders and a CV values less than 20% for hybridization and label replicates. This procedure detects 0,2 fmols of mRNA. We have found genes with statistically significant differences between Jurkatt and U937 cells cultures, and blood samples from 15 healthy donors, 59 lymphocyte leukemia and 13 myeloid leukemia and myelodisplasia syndrome patients. A classification system based on gene expression data was constructed with an accuracy of 97%.to predict healthy or lymphocyte leukemia status. To identify different subsets of patients in the B-CLL group, whole blood samples from 12 B-CLL patients were collected and defined as stables, according to clinical and analytical criteria at the time of diagnosis, “stable” (n=6) if disease stability was maintained for more than five years after the diagnosis and “progressive” (n=6) if the disease progressed less than one year after the diagnosis. Applying Welch statistical test without correction and a p<0.05 yielded two lists of 29 and 19 probes differentially hybridized from VSN and quantile-robust normalized data, respectively. The supervised hierarchical clustering of B-CLL samples with 29 statistically significant probes shown that samples grouped together based on their stable or progressive behavior. Eighteen probes were statistically significant in both normalized data. In order to confirm the data expression of POU2F2, PSMB4, FCER2, LCP1, and ABCC5 genes represented by 5 of the 18 statistically significant probes, real-time RT-PCR was performed. Three out of 5 genes -POU2F2, PSMB2, and FCER2- were over-expressed in B-CLL stable patients. Differences were statistically significant (P<0.05) and, therefore, results obtained from the chip for POU2F2, PSMB2, and FCER2 genes were confirmed. In conclusion, a viable LDM for gene expression analysis and a simple procedure has been developed useful for analysis of whole blood samples, without any cellular or sample manipulation prior to RNA extraction with variability and reproducibility similar to others commercial HDM. The application to different samples is capable to establish significant differences in gene clusters and could be useful for clinical application in HM


2021 ◽  
Vol 1 ◽  
Author(s):  
Mohamed Helmy ◽  
Rahul Agrawal ◽  
Javed Ali ◽  
Mohamed Soudy ◽  
Thuy Tien Bui ◽  
...  

Gene expression profiling techniques, such as DNA microarray and RNA-Sequencing, have provided significant impact on our understanding of biological systems. They contribute to almost all aspects of biomedical research, including studying developmental biology, host-parasite relationships, disease progression and drug effects. However, the high-throughput data generations present challenges for many wet experimentalists to analyze and take full advantage of such rich and complex data. Here we present GeneCloudOmics, an easy-to-use web server for high-throughput gene expression analysis that extends the functionality of our previous ABioTrans with several new tools, including protein datasets analysis, and a web interface. GeneCloudOmics allows both microarray and RNA-Seq data analysis with a comprehensive range of data analytics tools in one package that no other current standalone software or web-based tool can do. In total, GeneCloudOmics provides the user access to 23 different data analytical and bioinformatics tasks including reads normalization, scatter plots, linear/non-linear correlations, PCA, clustering (hierarchical, k-means, t-SNE, SOM), differential expression analyses, pathway enrichments, evolutionary analyses, pathological analyses, and protein-protein interaction (PPI) identifications. Furthermore, GeneCloudOmics allows the direct import of gene expression data from the NCBI Gene Expression Omnibus database. The user can perform all tasks rapidly through an intuitive graphical user interface that overcomes the hassle of coding, installing tools/packages/libraries and dealing with operating systems compatibility and version issues, complications that make data analysis tasks challenging for biologists. Thus, GeneCloudOmics is a one-stop open-source tool for gene expression data analysis and visualization. It is freely available at http://combio-sifbi.org/GeneCloudOmics.


Processes ◽  
2019 ◽  
Vol 7 (5) ◽  
pp. 301
Author(s):  
Muying Wang ◽  
Satoshi Fukuyama ◽  
Yoshihiro Kawaoka ◽  
Jason E. Shoemaker

Motivation: Immune cell dynamics is a critical factor of disease-associated pathology (immunopathology) that also impacts the levels of mRNAs in diseased tissue. Deconvolution algorithms attempt to infer cell quantities in a tissue/organ sample based on gene expression profiles and are often evaluated using artificial, non-complex samples. Their accuracy on estimating cell counts given temporal tissue gene expression data remains not well characterized and has never been characterized when using diseased lung. Further, how to remove the effects of cell migration on transcript counts to improve discovery of disease factors is an open question. Results: Four cell count inference (i.e., deconvolution) tools are evaluated using microarray data from influenza-infected lung sampled at several time points post-infection. The analysis finds that inferred cell quantities are accurate only for select cell types and there is a tendency for algorithms to have a good relative fit (R 2 ) but a poor absolute fit (normalized mean squared error; NMSE), which suggests systemic biases exist. Nonetheless, using cell fraction estimates to adjust gene expression data, we show that genes associated with influenza virus replication and increased infection pathology are more likely to be identified as significant than when applying traditional statistical tests.


Sign in / Sign up

Export Citation Format

Share Document