Interpretable per Case Weighted Ensemble Method for Cancer Associations

Mapping Intimacies ◽

10.1101/008185 ◽

2014 ◽

Author(s):

Adrin Jalali ◽

Nico Pfeifer

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Ribosomal Proteins ◽

Data Sets ◽

Data Set ◽

Cancer Data ◽

Combination Of Classifiers ◽

Cancer Types ◽

Or Gene ◽

Leukemia Data

Motivation: Molecular measurements from cancer patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples, might not be completely fullfilled. We introduce a method, that can estimate this bias on a per-feature level and incorporate calculated feature confidences into a weighted combination of classifiers with disjoint feature sets. Results: The new method achieves state-of-the-art performance on many different cancer data sets with measured DNA methylation or gene expression. Moreover, we show how to visualize the learned classifiers to find interesting associations with the target label. Applied to a leukemia data set we find several ribosomal proteins associated with leukemia's risk group that might be interesting targets for follow-up studies and support the hypothesis that the ribosomes are a new frontier in gene regulation. Availability: The method is available under GPLv3+ License at https: //github.com/adrinjalali/Network-Classifier.

Download Full-text

Case-Based Retrieval Framework for Gene Expression Data

Cancer Informatics ◽

10.4137/cin.s22371 ◽

2015 ◽

Vol 14 ◽

pp. CIN.S22371 ◽

Cited By ~ 4

Author(s):

Ali Anaissi ◽

Madhu Goyal ◽

Daniel R. Catchpoole ◽

Ali Braytee ◽

Paul J. Kennedy

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Gene Expression Data ◽

Childhood Leukemia ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Cancer Data ◽

Leukemia Data ◽

Case Based

Background The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. Methods This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. Results The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. Conclusion The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.

Download Full-text

Defining Immune Response Signatures in DLBCL As Potential Predictive Biomarkers for Outcome to Immunotherapy

Blood ◽

10.1182/blood.v126.23.2663.2663 ◽

2015 ◽

Vol 126 (23) ◽

pp. 2663-2663

Author(s):

Matthew A Care ◽

Stephen M Thirdborough ◽

Andrew J Davies ◽

Peter W.M. Johnson ◽

Andrew Jack ◽

...

Keyword(s):

Gene Expression ◽

Immune Response ◽

Network Analysis ◽

Gene Expression Data ◽

Research Funding ◽

Data Sets ◽

Expression Data ◽

Data Set ◽

Gene Correlation ◽

Cancer Types

Abstract Purpose To assess whether comparative gene network analysis can reveal characteristic immune response signatures that predict clinical response in Diffuse large B-cell lymphoma (DLBCL). Background The wealth of available gene expression data sets for DLBCL and other cancer types provides a resource to define recurrent pathological processes at the level of gene expression and gene correlation neighbourhoods. This is of particular relevance in the context of cancer immune responses, where convergence onto common patterns may drive shared gene expression profiles. Where existing and novel immunotherapies harness the immune response for therapeutic benefit such responses may provide predictive biomarkers. Methods We independently analysed publically available DLBCL gene expression data sets and a wide compendium of gene expression data from diverse cancer types, and then asked whether common elements of cancer host response could be identified from resulting networks. Using 10 DLBCL gene expression data sets, encompassing 2030 cases, we established pairwise gene correlation matrices per data set, which were merged to generate median correlations of gene pairs across all data sets. Gene network analysis and unsupervised clustering was then applied to define global representations of DLBCL gene expression neighbourhoods. In parallel a diverse range of solid and lymphoid malignancies including; breast, colorectal, oesophageal, head and neck, non-small cell lung, prostate, pancreatic cancer, Hodgkin lymphoma, Follicular lymphoma and DLBCL were independently analysed using an orthogonal weighted gene correlation network analysis of gene expression data sets from which correlated modules across diverse cancer types were identified. The biology of resulting gene neighbourhoods was assessed by signature and ontology enrichment, and the overlap between gene correlation neighbourhoods and WGCNA derived modules associated with immune/host responses was analysed. Results Amongst DLBCL data, we identified distinct gene correlation neighbourhoods associated with the immune response. These included both elements of IFN-polarised responses, core T-cell, and cytotoxic signatures as well as distinct macrophage responses. Neighbourhoods linked to macrophages separated CD163 from CD68 and CD14. In the WGCNA analysis of diverse cancer types clusters corresponding to these immune response neighbourhoods were independently identified including a highly similar cluster related to CD163. The overlapping CD163 clusters in both analyses linked to diverse Fc-Receptors, complement pathway components and patterns of scavenger receptors potentially linked to alternative macrophage activation. The relationship between the CD163 macrophage gene expression cluster and outcome was tested in DLBCL data sets, identifying a poor response in CD163 -cluster high patients, which reached statistical significance in one data set (GSE10846). Notably, the effect of the CD163-associated gene neighbourhood which correlates with poor outcome post rituximab containing immunochemotherapy is distinct from the effect of IFNG-STAT1-IRF1 polarised cytotoxic responses. The latter represents the predominant immune response pattern separating cell of origin unclassifiable (Type-III) DLBCL from either ABC or GCB DLBCL subsets, and is associated with a trend toward positive outcome. Conclusion Comparative gene expression network analysis identifies common immune response signatures shared between DLBCL and other cancer types. Gene expression clusters linked to CD163 macrophage responses and IFNG-STAT1-IRF1 polarised cytotoxic responses are common patterns with apparent divergent outcome association. Disclosures Davies: CTI: Honoraria; GIlead: Consultancy, Honoraria, Research Funding; Mundipharma: Honoraria, Research Funding; Bayer: Research Funding; Takeda: Honoraria, Research Funding; Janssen: Honoraria, Research Funding; Roche: Honoraria, Research Funding; GSK: Research Funding; Pfizer: Honoraria; Celgene: Honoraria, Research Funding. Jack:Jannsen: Research Funding.

Download Full-text

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽

10.3390/genes10100778 ◽

2019 ◽

Vol 10 (10) ◽

pp. 778 ◽

Cited By ~ 6

Author(s):

Liu ◽

Pan ◽

Li ◽

Yang ◽

...

Keyword(s):

Dna Methylation ◽

Deep Learning ◽

Sensitivity And Specificity ◽

Test Data ◽

Data Sets ◽

Methylation Data ◽

Average Sensitivity ◽

Validation Data ◽

Data Set ◽

Cancer Types

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Download Full-text

DEVELOPING OPTIMAL PREDICTION MODELS FOR CANCER CLASSIFICATION USING GENE EXPRESSION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720004000351 ◽

2004 ◽

Vol 01 (04) ◽

pp. 681-694 ◽

Cited By ~ 8

Author(s):

MAT SOUKUP ◽

JAE K. LEE

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Prediction Models ◽

Expression Patterns ◽

Training Sample ◽

Data Sets ◽

Expression Data ◽

Optimal Prediction ◽

Cancer Data ◽

Leukemia Data

Microarrays can provide genome-wide expression patterns for various cancers, especially for tumor sub-types that may exhibit substantially different patient prognosis. Using such gene expression data, several approaches have been proposed to classify tumor sub-types accurately. These classification methods are not robust, and often dependent on a particular training sample for modelling, which raises issues in utilizing these methods to administer proper treatment for a future patient. We propose to construct an optimal, robust prediction model for classifying cancer sub-types using gene expression data. Our model is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis. At each step, all identified models are validated by an independent sample of patients to develop a robust model for future data. We apply the proposed methods to two microarray data sets of cancer: the acute leukemia data by Golub et al.3 and the colon cancer data by Alon et al.12 We have found that the dimensionality of our optimal prediction models is relatively small for these cases and that our prediction models with one or two gene factors outperforms or has competing performance, especially for independent samples, to other methods based on 50 or more predictive gene factors. The methodology is implemented and developed by the procedures in R and Splus. The source code can be obtained at .

Download Full-text

Investigation of the international comparability of population-based routine hospital data set derived comorbidity scores for patients with lung cancer

Thorax ◽

10.1136/thoraxjnl-2017-210362 ◽

2017 ◽

Vol 73 (4) ◽

pp. 339-349 ◽

Cited By ~ 4

Author(s):

Margreet Lüchtenborg ◽

Eva J A Morris ◽

Daniela Tataru ◽

Victoria H Coupland ◽

Andrew Smith ◽

...

Keyword(s):

Lung Cancer ◽

Cancer Survival ◽

Population Based ◽

Data Sets ◽

Hospital Data ◽

Data Set ◽

Discharge Data ◽

International Differences ◽

Cancer Data ◽

Comorbidity Scores

IntroductionThe International Cancer Benchmarking Partnership (ICBP) identified significant international differences in lung cancer survival. Differing levels of comorbid disease across ICBP countries has been suggested as a potential explanation of this variation but, to date, no studies have quantified its impact. This study investigated whether comparable, robust comorbidity scores can be derived from the different routine population-based cancer data sets available in the ICBP jurisdictions and, if so, use them to quantify international variation in comorbidity and determine its influence on outcome.MethodsLinked population-based lung cancer registry and hospital discharge data sets were acquired from nine ICBP jurisdictions in Australia, Canada, Norway and the UK providing a study population of 233 981 individuals. For each person in this cohort Charlson, Elixhauser and inpatient bed day Comorbidity Scores were derived relating to the 4–36 months prior to their lung cancer diagnosis. The scores were then compared to assess their validity and feasibility of use in international survival comparisons.ResultsIt was feasible to generate the three comorbidity scores for each jurisdiction, which were found to have good content, face and concurrent validity. Predictive validity was limited and there was evidence that the reliability was questionable.ConclusionThe results presented here indicate that interjurisdictional comparability of recorded comorbidity was limited due to probable differences in coding and hospital admission practices in each area. Before the contribution of comorbidity on international differences in cancer survival can be investigated an internationally harmonised comorbidity index is required.

Download Full-text

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

BMC Genomics ◽

10.1186/s12864-020-07038-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Da Xu ◽

Jialin Zhang ◽

Hanxiao Xu ◽

Yusen Zhang ◽

Wei Chen ◽

...

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Therapeutic Targets ◽

Genomic Data ◽

Data Sets ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Model Learning ◽

Data Set ◽

Multi Scale

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.

Download Full-text

The effects of a globin blocker on the resolution of 3’mRNA sequencing data in porcine blood

BMC Genomics ◽

10.1186/s12864-019-6122-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Kyu-Sang Lim ◽

Qian Dong ◽

Pamela Moll ◽

Jana Vitkovska ◽

Gregor Wiktorin ◽

...

Keyword(s):

Gene Expression ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Data Sets ◽

Globin Genes ◽

Sequencing Data ◽

Globin Mrna ◽

Data Set ◽

Mrna Sequencing ◽

Porcine Blood

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.

Download Full-text

SPECTRAL CLUSTERING ON GENE EXPRESSION PROFILE TO IDENTIFY CANCER TYPES OR SUBTYPES

Jurnal Teknologi ◽

10.11113/jt.v76.4036 ◽

2015 ◽

Vol 76 (1) ◽

Author(s):

Ang Jun Chin ◽

Andri Mirzal ◽

Habibollah Haron

Keyword(s):

Gene Expression ◽

Gene Expression Profile ◽

Expression Profile ◽

Microarray Data ◽

Spectral Clustering ◽

Data Sets ◽

Clustering Methods ◽

Microarray Gene Expression ◽

Cancer Types ◽

Microarray Gene

Gene expression profile is eminent for its broad applications and achievements in disease discovery and analysis, especially in cancer research. Spectral clustering is robust to irrelevant features which are appropriated for gene expression analysis. However, previous works show that performance comparison with other clustering methods is limited and only a few microarray data sets were analyzed in each study. In this study, we demonstrate the use of spectral clustering in identifying cancer types or subtypes from microarray gene expression profiling. Spectral clustering was applied to eleven microarray data sets and its clustering performances were compared with the results in the literature. Based on the result, overall the spectral clustering slightly outperformed the corresponding results in the literature. The spectral clustering can also offer more stable clustering performances as it has smaller standard deviation value. Moreover, out of eleven data sets the spectral clustering outperformed the corresponding methods in the literature for six data sets. So, it can be stated that the spectral clustering is a promising method in identifying the cancer types or subtypes for microarray gene expression data sets.

Download Full-text

Integrative Analysis Reveals Comprehensive Altered Metabolic Genes Linking with Tumor Epigenetics Modification in Pan-Cancer

BioMed Research International ◽

10.1155/2019/6706354 ◽

2019 ◽

Vol 2019 ◽

pp. 1-17 ◽

Cited By ~ 1

Author(s):

Yahui Shi ◽

Jinfen Wei ◽

Zixi Chen ◽

Yuchen Yuan ◽

Xingsong Li ◽

...

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Histone Acetylation ◽

Epigenetic Modification ◽

Metabolic Reprogramming ◽

The Cancer Genome Atlas ◽

Altered Expression ◽

Metabolic Genes ◽

Cancer Types ◽

Pan Cancer

Background. Cancer cells undergo various rewiring of metabolism and dysfunction of epigenetic modification to support their biosynthetic needs. Although the major features of metabolic reprogramming have been elucidated, the global metabolic genes linking epigenetics were overlooked in pan-cancer. Objectives. Identifying the critical metabolic signatures with differential expressions which contributes to the epigenetic alternations across cancer types is an urgent issue for providing the potential targets for cancer therapy. Method. The differential gene expression and DNA methylation were analyzed by using the 5726 samples data from the Cancer Genome Atlas (TCGA). Results. Firstly, we analyzed the differential expression of metabolic genes and found that cancer underwent overall metabolism reprogramming, which exhibited a similar expression trend with the data from the Gene Expression Omnibus (GEO) database. Secondly, the regulatory network of histone acetylation and DNA methylation according to altered expression of metabolism genes was summarized in our results. Then, the survival analysis showed that high expression of DNMT3B had a poorer overall survival in 5 cancer types. Integrative altered methylation and expression revealed specific genes influenced by DNMT3B through DNA methylation across cancers. These genes do not overlap across various cancer types and are involved in different function annotations depending on the tissues, which indicated DNMT3B might influence DNA methylation in tissue specificity. Conclusions. Our research clarifies some key metabolic genes, ACLY, SLC2A1, KAT2A, and DNMT3B, which are most disordered and indirectly contribute to the dysfunction of histone acetylation and DNA methylation in cancer. We also found some potential genes in different cancer types influenced by DNMT3B. Our study highlights possible epigenetic disorders resulting from the deregulation of metabolic genes in pan-cancer and provides potential therapy in the clinical treatment of human cancer.

Download Full-text

CoMM: A Collaborative Mixed Model That Integrates GWAS and eQTL Data Sets to Investigate the Genetic Architecture of Complex Traits

Bioinformatics and Biology Insights ◽

10.1177/1177932219881435 ◽

2019 ◽

Vol 13 ◽

pp. 117793221988143 ◽

Cited By ~ 1

Author(s):

Kar-Fu Yeung ◽

Yi Yang ◽

Can Yang ◽

Jin Liu

Keyword(s):

Gene Expression ◽

Association Study ◽

Genetic Variants ◽

Complex Traits ◽

Mixed Model ◽

Genome Wide Association Study ◽

Data Sets ◽

Transcriptome Data ◽

Data Set ◽

Expression Levels

Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then ‘imputed’ using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM’s ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.

Download Full-text