BIOMARKER DISCOVERY AND VISUALIZATION IN GENE EXPRESSION DATA WITH EFFICIENT GENERALIZED MATRIX APPROXIMATIONS

In most real-world gene expression data sets, there are often multiple sample classes with ordinals, which are categorized into the normal or diseased type. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to the up/down regulation across the normal and diseased types of classes, while the specific gene selection methods particularly consider the differential expressions across the normal and diseased, but ignore the existence of multiple classes. In this paper, to improve the biomarker discovery, we propose to make the best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of data set characteristic). Therefore, we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). Our results show that GMA cannot only improve the accuracy of classifying the samples, but also provide a visualization method to effectively analyze the gene expression data on both genes and samples. Based on the mechanism of matrix approximation, we further propose an algorithm, CBiomarker, to discover compact biomarker by reducing the redundancy.

Download Full-text

EFFICIENT GENERALIZED MATRIX APPROXIMATIONS FOR BIOMARKER DISCOVERY AND VISUALIZATION IN GENE EXPRESSION DATA

Computational Systems Bioinformatics ◽

10.1142/9781860947575_0020 ◽

2006 ◽

Cited By ~ 3

Author(s):

Wenyuan Li ◽

Yanxiong Peng ◽

Hung-Chung Huang ◽

Ying Liu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Biomarker Discovery ◽

Expression Data ◽

Generalized Matrix ◽

Matrix Approximations

Download Full-text

A fuzzy logic approach to analyzing gene expression data

Physiological Genomics ◽

10.1152/physiolgenomics.2000.3.1.9 ◽

2000 ◽

Vol 3 (1) ◽

pp. 9-15 ◽

Cited By ~ 138

Author(s):

PETER J. WOOLF ◽

YIXIN WANG

Keyword(s):

Gene Expression ◽

Fuzzy Logic ◽

Gene Expression Data ◽

Expression Data ◽

Heuristic Rules ◽

Yeast Gene ◽

Data Set ◽

Fuzzy Logic Approach ◽

Logic Approach ◽

Novel Algorithm

Woolf, Peter J., and Yixin Wang. A fuzzy logic approach to analyzing gene expression data. Physiol Genomics 3: 9–15, 2000.—We have developed a novel algorithm for analyzing gene expression data. This algorithm uses fuzzy logic to transform expression values into qualitative descriptors that can be evaluated by using a set of heuristic rules. In our tests we designed a model to find triplets of activators, repressors, and targets in a yeast gene expression data set. For the conditions tested, the predictions made by the algorithm agree well with experimental data in the literature. The algorithm can also assist in determining the function of uncharacterized proteins and is able to detect a substantially larger number of transcription factors than could be found at random. This technology extends current techniques such as clustering in that it allows the user to generate a connected network of genes using only expression data.

Download Full-text

Inference of Genetic Networks From Time-Series and Static Gene Expression Data: Combining a Random-Forest-Based Inference Method With Feature Selection Methods

Frontiers in Genetics ◽

10.3389/fgene.2020.595912 ◽

2020 ◽

Vol 11 ◽

Author(s):

Shuhei Kimura ◽

Ryo Fukutomi ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Random Forest ◽

Gene Expression Data ◽

Computational Cost ◽

Expression Data ◽

Selection Methods ◽

Inference Method ◽

Combined Application ◽

Inference Methods

Several researchers have focused on random-forest-based inference methods because of their excellent performance. Some of these inference methods also have a useful ability to analyze both time-series and static gene expression data. However, they are only of use in ranking all of the candidate regulations by assigning them confidence values. None have been capable of detecting the regulations that actually affect a gene of interest. In this study, we propose a method to remove unpromising candidate regulations by combining the random-forest-based inference method with a series of feature selection methods. In addition to detecting unpromising regulations, our proposed method uses outputs from the feature selection methods to adjust the confidence values of all of the candidate regulations that have been computed by the random-forest-based inference method. Numerical experiments showed that the combined application with the feature selection methods improved the performance of the random-forest-based inference method on 99 of the 100 trials performed on the artificial problems. However, the improvement tends to be small, since our combined method succeeded in removing only 19% of the candidate regulations at most. The combined application with the feature selection methods moreover makes the computational cost higher. While a bigger improvement at a lower computational cost would be ideal, we see no impediments to our investigation, given that our aim is to extract as much useful information as possible from a limited amount of gene expression data.

Download Full-text

Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

BioMed Research International ◽

10.1155/2019/2497509 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Suyan Tian ◽

Chi Wang ◽

Bing Wang

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Knowledge ◽

Expression Data ◽

Selection Methods ◽

Its Gene ◽

Active Research

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Download Full-text

Searching for master regulators of transcription in a human gene expression data set

BMC Proceedings ◽

10.1186/1753-6561-1-s1-s81 ◽

2007 ◽

Vol 1 (S1) ◽

Cited By ~ 2

Author(s):

Alfonso Buil ◽

Alexandre Perera-Lluna ◽

Ramon Souto ◽

Juan M Peralta ◽

Laura Almasy ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Human Gene ◽

Expression Data ◽

Data Set ◽

Master Regulators ◽

Human Gene Expression

Download Full-text

A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification

IEEE Access ◽

10.1109/access.2019.2922987 ◽

2019 ◽

Vol 7 ◽

pp. 78533-78548 ◽

Cited By ~ 21

Author(s):

Nada Almugren ◽

Hala Alshamlan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Selection Methods ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

CLASSIFYING TEMPORAL MICROARRAY DATA BY SELECTING INFORMATIVE GENES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013410060 ◽

2013 ◽

Vol 11 (03) ◽

pp. 1341006

Author(s):

QIANG LOU ◽

ZORAN OBRADOVIC

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Data ◽

Data Sets ◽

Temporal Data ◽

Expression Data ◽

Selection Methods ◽

Temporal Gene Expression ◽

Single Matrix

In order to more accurately predict an individual's health status, in clinical applications it is often important to perform analysis of high-dimensional gene expression data that varies with time. A major challenge in predicting from such temporal microarray data is that the number of biomarkers used as features is typically much larger than the number of labeled subjects. One way to address this challenge is to perform feature selection as a preprocessing step and then apply a classification method on selected features. However, traditional feature selection methods cannot handle multivariate temporal data without applying techniques that flatten temporal data into a single matrix in advance. In this study, a feature selection filter that can directly select informative features from temporal gene expression data is proposed. In our approach, we measure the distance between multivariate temporal data from two subjects. Based on this distance, we define the objective function of temporal margin based feature selection to maximize each subject's temporal margin in its own relevant subspace. The experimental results on synthetic and two real flu data sets provide evidence that our method outperforms the alternatives, which flatten the temporal data in advance.

Download Full-text

Defining Immune Response Signatures in DLBCL As Potential Predictive Biomarkers for Outcome to Immunotherapy

Blood ◽

10.1182/blood.v126.23.2663.2663 ◽

2015 ◽

Vol 126 (23) ◽

pp. 2663-2663

Author(s):

Matthew A Care ◽

Stephen M Thirdborough ◽

Andrew J Davies ◽

Peter W.M. Johnson ◽

Andrew Jack ◽

...

Keyword(s):

Gene Expression ◽

Immune Response ◽

Network Analysis ◽

Gene Expression Data ◽

Research Funding ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Gene Correlation ◽

Cancer Types

Abstract Purpose To assess whether comparative gene network analysis can reveal characteristic immune response signatures that predict clinical response in Diffuse large B-cell lymphoma (DLBCL). Background The wealth of available gene expression data sets for DLBCL and other cancer types provides a resource to define recurrent pathological processes at the level of gene expression and gene correlation neighbourhoods. This is of particular relevance in the context of cancer immune responses, where convergence onto common patterns may drive shared gene expression profiles. Where existing and novel immunotherapies harness the immune response for therapeutic benefit such responses may provide predictive biomarkers. Methods We independently analysed publically available DLBCL gene expression data sets and a wide compendium of gene expression data from diverse cancer types, and then asked whether common elements of cancer host response could be identified from resulting networks. Using 10 DLBCL gene expression data sets, encompassing 2030 cases, we established pairwise gene correlation matrices per data set, which were merged to generate median correlations of gene pairs across all data sets. Gene network analysis and unsupervised clustering was then applied to define global representations of DLBCL gene expression neighbourhoods. In parallel a diverse range of solid and lymphoid malignancies including; breast, colorectal, oesophageal, head and neck, non-small cell lung, prostate, pancreatic cancer, Hodgkin lymphoma, Follicular lymphoma and DLBCL were independently analysed using an orthogonal weighted gene correlation network analysis of gene expression data sets from which correlated modules across diverse cancer types were identified. The biology of resulting gene neighbourhoods was assessed by signature and ontology enrichment, and the overlap between gene correlation neighbourhoods and WGCNA derived modules associated with immune/host responses was analysed. Results Amongst DLBCL data, we identified distinct gene correlation neighbourhoods associated with the immune response. These included both elements of IFN-polarised responses, core T-cell, and cytotoxic signatures as well as distinct macrophage responses. Neighbourhoods linked to macrophages separated CD163 from CD68 and CD14. In the WGCNA analysis of diverse cancer types clusters corresponding to these immune response neighbourhoods were independently identified including a highly similar cluster related to CD163. The overlapping CD163 clusters in both analyses linked to diverse Fc-Receptors, complement pathway components and patterns of scavenger receptors potentially linked to alternative macrophage activation. The relationship between the CD163 macrophage gene expression cluster and outcome was tested in DLBCL data sets, identifying a poor response in CD163 -cluster high patients, which reached statistical significance in one data set (GSE10846). Notably, the effect of the CD163-associated gene neighbourhood which correlates with poor outcome post rituximab containing immunochemotherapy is distinct from the effect of IFNG-STAT1-IRF1 polarised cytotoxic responses. The latter represents the predominant immune response pattern separating cell of origin unclassifiable (Type-III) DLBCL from either ABC or GCB DLBCL subsets, and is associated with a trend toward positive outcome. Conclusion Comparative gene expression network analysis identifies common immune response signatures shared between DLBCL and other cancer types. Gene expression clusters linked to CD163 macrophage responses and IFNG-STAT1-IRF1 polarised cytotoxic responses are common patterns with apparent divergent outcome association. Disclosures Davies: CTI: Honoraria; GIlead: Consultancy, Honoraria, Research Funding; Mundipharma: Honoraria, Research Funding; Bayer: Research Funding; Takeda: Honoraria, Research Funding; Janssen: Honoraria, Research Funding; Roche: Honoraria, Research Funding; GSK: Research Funding; Pfizer: Honoraria; Celgene: Honoraria, Research Funding. Jack:Jannsen: Research Funding.

Download Full-text

Prediction of Optimal Cytoreductive Surgery of Serous Ovarian Cancer With Gene Expression Data

International Journal of Gynecological Cancer ◽

10.1097/igc.0000000000000449 ◽

2015 ◽

Vol 25 (6) ◽

pp. 1000-1009 ◽

Cited By ~ 4

Author(s):

Reem Abdallah ◽

Hye Sook Chon ◽

Nadim Bou Zgheib ◽

Douglas C. Marchion ◽

Robert M. Wenham ◽

...

Keyword(s):

Gene Expression ◽

Ovarian Cancer ◽

Gene Expression Data ◽

Expression Analysis ◽

Cytoreductive Surgery ◽

Tumor Biology ◽

Gene Expression Signature ◽

Optimal Cytoreduction ◽

Expression Data ◽

Data Set

ObjectivesCytoreductive surgery is the cornerstone of ovarian cancer (OVCA) treatment. Detractors of initial maximal surgical effort argue that aggressive tumor biology will dictate survival, not the surgical effort. We investigated the role of biology in achieving optimal cytoreduction in serous OVCA using microarray gene expression analysis.MethodsFor the initial model, we used a gene expression signature from a microarray expression analysis of 124 women with serous OVCA, defining optimal cytoreduction as removal of all disease greater than 1 cm (with 64 women having optimal and 60 suboptimal cytoreduction). We then applied this model to 2 independent data sets: the Australian Ovarian Cancer Study (AOCS; 190 samples) and The Cancer Genome Atlas (TCGA; 468 samples). We performed a second analysis, defining optimal cytoreduction as removal of all disease to microscopic residual, using data from AOCS to create the gene signature and validating results in TCGA data set.ResultsOf the 12,718 genes included in the initial analysis, 58 predicted accuracy of cytoreductive surgery 69% of the time (P= 0.005). The performance of this classifier, measured by the area under the receiver operating characteristic curve, was 73%. When applied to TCGA and AOCS, accuracy was 56% (P= 0.16) and 62% (P= 0.01), respectively, with performance at 57% and 65%, respectively. In the second analysis, 220 genes predicted accuracy of cytoreductive surgery in the AOCS set 74% of the time, with performance of 73%. When these results were validated in TCGA set, accuracy was 57% (P= 0.31) and performance was at 62%.ConclusionGene expression data, used as a proxy of tumor biology, do not predict accurately nor consistently the ability to perform optimal cytoreductive surgery. Other factors, including surgical effort, may also explain part of the model. Additional studies integrating more biological and clinical data may improve the prediction model.

Download Full-text