scholarly journals Application of Gene Shaving and Mixture Models to Cluster Microarray Gene Expression Data

2007 ◽  
Vol 5 ◽  
pp. 117693510700500
Author(s):  
K-A. Do ◽  
G.J. McLachlan ◽  
R. Bean ◽  
S. Wen

Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid meukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced.

2019 ◽  
Author(s):  
Dan MacLean

AbstractGene Regulatory networks that control gene expression are widely studied yet the interactions that make them up are difficult to predict from high throughput data. Deep Learning methods such as convolutional neural networks can perform surprisingly good classifications on a variety of data types and the matrix-like gene expression profiles would seem to be ideal input data for deep learning approaches. In this short study I compiled training sets of expression data using the Arabidopsis AtGenExpress global stress expression data set and known transcription factor-target interactions from the Arabidopsis PLACE database. I built and optimised convolutional neural networks with a best model providing 95 % accuracy of classification on a held-out validation set. Investigation of the activations within this model revealed that classification was based on positive correlation of expression profiles in short sections. This result shows that a convolutional neural network can be used to make classifications and reveal the basis of those calssifications for gene expression data sets, indicating that a convolutional neural network is a useful and interpretable tool for exploratory classification of biological data. The final model is available for download and as a web application.


Blood ◽  
2013 ◽  
Vol 122 (21) ◽  
pp. 1390-1390
Author(s):  
Jitsuda Sitthi-Amorn ◽  
Betty Herrington ◽  
Gail Megason ◽  
Jeanette Pullen ◽  
Catherine Gordon ◽  
...  

Abstract Introduction Despite advances in diagnosis and treatment, B-precursor acute lymphoblastic leukemia (B-ALL) remains the most common childhood cancer and one of the leading causes of cancer-related death in children and adolescents. Although B-ALL is highly curable, approximately 10 - 20% of children diagnosed with B-ALL still do not respond to the current treatment protocols. Minimal residual disease (MRD) at the end of induction of remission is strongly associated with prognosis. Therefore there is an urgent need to understand the molecular mechanisms underpinning MRD and to identify biomarkers for the development of novel and more effective therapeutic strategies. This project was undertaken to determine whether molecular perturbation in patients with positive MRD at day 46 differs from those with negative MRD in different subtypes of B-ALL and to identify biological pathways dysregulated. We hypothesized that gene expression profiles differ significantly between patients with positive MRD at day 46 and patients with negative MRD. Methods We analyzed publicly available gene expression data derived from samples obtained from 189 patients with B-ALL (47 with positive MRD at day 46 and 142 with negative MRD). The data was downloaded from the NCBI’s Gene Expression Omnibus (GEO) database under accession number GSE33315. Patients were classified into seven subtypes of B-ALL which are hyperdiploid, ETV6-RUNX1, MLL rearrangement, hypodiploid, BCR-ABL1, TCF3-PBX1 and others (no detectable recurring genetic abnormalities). Samples from patients with BCR-ABL1 were excluded due to a different prognosis and treatment approach. Patients with TCF3-PBX1 were excluded due to the small sample size; leaving 165 patients in the analysis (35 with positive MRD at day 46 and 130 with negative MRD). We analyzed gene expression data using both supervised and unsupervised analysis. Supervised analysis was performed between patients with positive MRD and negative MRD for each subtype of B-ALL. Unsupervised analysis using hierarchical clustering was performed on significantly differently expressed genes (P < 0.005) to identify functionally related genes with similar patterns of expression profiles. Pathway analysis was performed using the Ingenuity Pathways Analysis (IPA) system to identify biological pathways that are dysregulated in response to positive MRD in different subtypes of B-ALL. Result Comparison of gene expression profiles between positive MRD and negative MRD revealed significantly differentially expressed genes between the two groups. The numbers of significantly (P < 0.005) differentially expressed genes for hyperdiploid, ETV6-RUNX1, MLL rearrangement, hypodiploid and others were 93, 82, 87, 140 and 289 genes; respectively. The identified genes included BCL2, BECN1, CBFB, IKZF1, PAX5, SH2B3 and TOX which are known to be associated with B-ALL. Unsupervised analysis using hierarchical clustering and GO analysis revealed similarity in patterns of gene expression within subtypes of B-ALL and functional relationships among the identified genes. Among the identified genes included genes involved in cell death and survival, cellular development and DNA replication, recombination, and repair. Network and Pathway analysis revealed multi-gene regulatory networks and key biological pathways including granzyme B signaling, TCA cycle II and B cell receptor signaling. Pathway analysis also revealed upstream regulators including RB1, CDKN2A and TP53 which have been reported to be involved in the hypodiploid subtype, a subtype characterized with poorer prognosis. Conclusion Although the sample size is small, our analysis demonstrates that molecular perturbation significantly differs between pediatric B-ALL patients with positive MRD and those with negative MRD, and that these differences are subtype-specific. The results further demonstrate that biological pathways are dysregulated in response to MRD status and that use of gene expression analysis has the promise to stratify patients on the basis of MRD status and to identify potential biomarkers. Disclosures: No relevant conflicts of interest to declare.


2010 ◽  
Vol 28 (15) ◽  
pp. 2529-2537 ◽  
Author(s):  
Torsten Haferlach ◽  
Alexander Kohlmann ◽  
Lothar Wieczorek ◽  
Giuseppe Basso ◽  
Geertruy Te Kronnie ◽  
...  

Purpose The Microarray Innovations in Leukemia study assessed the clinical utility of gene expression profiling as a single test to subtype leukemias into conventional categories of myeloid and lymphoid malignancies. Methods The investigation was performed in 11 laboratories across three continents and included 3,334 patients. An exploratory retrospective stage I study was designed for biomarker discovery and generated whole-genome expression profiles from 2,143 patients with leukemias and myelodysplastic syndromes. The gene expression profiling–based diagnostic accuracy was further validated in a prospective second study stage of an independent cohort of 1,191 patients. Results On the basis of 2,096 samples, the stage I study achieved 92.2% classification accuracy for all 18 distinct classes investigated (median specificity of 99.7%). In a second cohort of 1,152 prospectively collected patients, a classification scheme reached 95.6% median sensitivity and 99.8% median specificity for 14 standard subtypes of acute leukemia (eight acute lymphoblastic leukemia and six acute myeloid leukemia classes, n = 693). In 29 (57%) of 51 discrepant cases, the microarray results had outperformed routine diagnostic methods. Conclusion Gene expression profiling is a robust technology for the diagnosis of hematologic malignancies with high accuracy. It may complement current diagnostic algorithms and could offer a reliable platform for patients who lack access to today's state-of-the-art diagnostic work-up. Our comprehensive gene expression data set will be submitted to the public domain to foster research focusing on the molecular understanding of leukemias.


Cephalalgia ◽  
2016 ◽  
Vol 36 (7) ◽  
pp. 669-678 ◽  
Author(s):  
Zachary Gerring ◽  
Astrid J Rodriguez-Acevedo ◽  
Joseph E Powell ◽  
Lyn R Griffiths ◽  
Grant W Montgomery ◽  
...  

Background Global gene expression analysis may be used to obtain insights into the functional processes underlying migraine. However, there is a shortage of high-quality post-mortem brain tissue samples for genetic analysis. One approach is to use a more accessible tissue as a surrogate, such as peripheral blood. Purpose Discuss the benefits and caveats of blood genomic profiling in migraine and its potential application in the development of biomarkers of migraine susceptibility and outcome. Demonstrate the utility of blood-based expression profiles in migraine by analysing pilot Illumina HT-12 expression data from 76 (38 case, 38 control) whole-blood samples. Conclusion Current evidence suggests peripheral blood is a biologically valid substrate for genetic studies of migraine, and may be used to identify biomarkers and therapeutic pathways. Pilot blood gene expression data confirm that expression profiles significantly differ between migraine case and non-migraine control individuals.


2009 ◽  
Vol 3 ◽  
pp. BBI.S2908 ◽  
Author(s):  
Mihir S. Sewak ◽  
Narender P. Reddy ◽  
Zhong-Hui Duan

Analysis of gene expression data provides an objective and efficient technique for sub-classification of leukemia. The purpose of the present study was to design a committee neural networks based classification systems to subcategorize leukemia gene expression data. In the study, a binary classification system was considered to differentiate acute lymphoblastic leukemia from acute myeloid leukemia. A ternary classification system which classifies leukemia expression data into three subclasses including B-cell acute lymphoblastic leukemia, T-cell acute lymphoblastic leukemia and acute myeloid leukemia was also developed. In each classification system gene expression profiles of leukemia patients were first subjected to a sequence of simple preprocessing steps. This resulted in filtering out approximately 95 percent of the non-informative genes. The remaining 5 percent of the informative genes were used to train a set of artificial neural networks with different parameters and architectures. The networks that gave the best results during initial testing were recruited into a committee. The committee decision was by majority voting. The committee neural network system was later evaluated using data not used in training. The binary classification system classified microarray gene expression profiles into two categories with 100 percent accuracy and the ternary system correctly predicted the three subclasses of leukemia in over 97 percent of the cases.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Wim De Mulder ◽  
Martin Kuiper ◽  
René Boel

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.


2004 ◽  
Vol 43 (01) ◽  
pp. 4-8 ◽  
Author(s):  
A. Luchini ◽  
C. Di Bello ◽  
S. Bicciato

Summary Objectives: High-throughput technologies are radically boosting the understanding of living systems, thus creating enormous opportunities to elucidate the biological processes of cells in different physiological states. In particular, the application of DNA micro-arrays to monitor expression profiles from tumor cells is improving cancer analysis to levels that classical methods have been unable to reach. However, molecular diagnostics based on expression profiling requires addressing computational issues as the overwhelming number of variables and the complex, multi-class nature of tumor samples. Thus, the objective of the present research has been the development of a computational procedure for feature extraction and classification of gene expression data. Methods: The Soft Independent Modeling of Class Analogy (SIMCA) approach has been implemented in a data mining scheme, which allows the identification of those genes that are most likely to confer robust and accurate classification of samples from multiple tumor types. Results: The proposed method has been tested on two different microarray data sets, namely Golub’s analysis of acute human leukemia [1] and the small round blue cell tumors study presented by Khan et al. [2]. The identified features represent a rational and dimensionally reduced base for understanding the biology of diseases, defining targets of therapeutic intervention, and developing diagnostic tools for classification of pathological states. Conclusions: The analysis of the SIMCA model residuals allows the identification of specific phenotype markers. At the same time, the class analogy approach provides the assignment to multiple classes, such as different pathological conditions or tissue samples, for previously unseen instances.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Océane Cassan ◽  
Sophie Lèbre ◽  
Antoine Martin

Abstract Background High-throughput transcriptomic datasets are often examined to discover new actors and regulators of a biological response. To this end, graphical interfaces have been developed and allow a broad range of users to conduct standard analyses from RNA-seq data, even with little programming experience. Although existing solutions usually provide adequate procedures for normalization, exploration or differential expression, more advanced features, such as gene clustering or regulatory network inference, often miss or do not reflect current state of the art methodologies. Results We developed here a user interface called DIANE (Dashboard for the Inference and Analysis of Networks from Expression data) designed to harness the potential of multi-factorial expression datasets from any organisms through a precise set of methods. DIANE interactive workflow provides normalization, dimensionality reduction, differential expression and ontology enrichment. Gene clustering can be performed and explored via configurable Mixture Models, and Random Forests are used to infer gene regulatory networks. DIANE also includes a novel procedure to assess the statistical significance of regulator-target influence measures based on permutations for Random Forest importance metrics. All along the pipeline, session reports and results can be downloaded to ensure clear and reproducible analyses. Conclusions We demonstrate the value and the benefits of DIANE using a recently published data set describing the transcriptional response of Arabidopsis thaliana under the combination of temperature, drought and salinity perturbations. We show that DIANE can intuitively carry out informative exploration and statistical procedures with RNA-Seq data, perform model based gene expression profiles clustering and go further into gene network reconstruction, providing relevant candidate genes or signalling pathways to explore. DIANE is available as a web service (https://diane.bpmp.inrae.fr), or can be installed and locally launched as a complete R package.


2003 ◽  
Vol 11 (01) ◽  
pp. 43-56 ◽  
Author(s):  
KRZYSZTOF FUJAREWICZ ◽  
MAREK KIMMEL ◽  
JOANNA RZESZOWSKA-WOLNY ◽  
ANDRZEJ SWIERNIAK

Microarrays provide a new technique of measuring gene expression that attracted a lot of research interest in recent years. It has been suggested that gene expression data from microarrays (biochips) can be utilized in many biomedical areas, for example in cancer classification. Whereas several, new and existing, methods of classification has been tested, a selection of proper (optimal) set of genes, which expression serves during classification, is still an open problem. In this paper we propose a heuristic method of choosing suboptimal set of genes by using support vector machines (SVM). Obtained set of genes optimizes leave-one-out cross-validation error. The method is tested on microarray gene expression data of samples of two cancer types: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). The results show that quality of classification is much better than for sets obtained using other methods of feature selection. In addition, we demonstrate that maximum separation in a training data set may lead to deterioration of performance in an independent validation data set, a phenomenon akin to overfitting.


2017 ◽  
Vol 20 (2) ◽  
Author(s):  
Jorge Parraga-Alava ◽  
Mario Inostroza-Ponta

Clustering algorithms are a common method for data analysis in many science field. They have become popular among biologists because of ease to discovery similar cellular functions in gene expression data. Most approaches consider the gene clustering as an optimization problem, where an ad-hoc cluster quality index is optimized which can be defined regarding gene expression data or biological information. However, these approaches may not be sufficient since they cannot guarantee to generate clusters with similar expression patterns and biological coherence. In this paper, we propose a bi-objective clustering algorithm to discover clusters of genes with high levels of co-expression and biological coherence. Our approach uses a multi-objective evolutionary algorithm (MOEA) that optimizes two index based on gene expression level and biological functional classes. The algorithm is tested on three real-life gene expression datasets. Results show that the proposed model yields gene clusters with higher levels of co-expression and biological coherence than traditional approaches.


Sign in / Sign up

Export Citation Format

Share Document