scholarly journals Prediction of Missing Values in Microarray and Use of Mixed Models to Evaluate the Predictors

Author(s):  
Guri Feten ◽  
Trygve Almøy ◽  
Are H. Aastveit

Gene expression microarray experiments generate data sets with multiple missing expression values. In some cases, analysis of gene expression requires a complete matrix as input. Either genes with missing values can be removed, or the missing values can be replaced using prediction. We propose six imputation methods. A comparative study of the methods was performed on data from mice and data from the bacterium Enterococcus faecalis, and a linear mixed model was used to test for differences between the methods. The study showed that different methods' capability to predict is dependent on the data, hence the ideal choice of method and number of components are different for each data set. For data with correlation structure methods based on K-nearest neighbours seemed to be best, while for data without correlation structure using the average of the gene was to be preferred.

2019 ◽  
Vol 13 ◽  
pp. 117793221988143 ◽  
Author(s):  
Kar-Fu Yeung ◽  
Yi Yang ◽  
Can Yang ◽  
Jin Liu

Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then ‘imputed’ using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM’s ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.


2021 ◽  
pp. 1-29
Author(s):  
Nicole Sanford ◽  
Todd S. Woodward

Abstract Background: Working memory (WM) impairment in schizophrenia substantially impacts functional outcome. Although the dorsolateral pFC has been implicated in such impairment, a more comprehensive examination of brain networks comprising pFC is warranted. The present research used a whole-brain, multi-experiment analysis to delineate task-related networks comprising pFC. Activity was examined in schizophrenia patients across a variety of cognitive demands. Methods: One hundred schizophrenia patients and 102 healthy controls completed one of four fMRI tasks: a Sternberg verbal WM task, a visuospatial WM task, a Stroop set-switching task, and a thought generation task (TGT). Task-related networks were identified using multi-experiment constrained PCA for fMRI. Effects of task conditions and group differences were examined using mixed-model ANOVA on the task-related time series. Correlations between task performance and network engagement were also performed. Results: Four spatially and temporally distinct networks with pFC activation emerged and were postulated to subserve (1) internal attention, (2) auditory–motor attention, (3) motor responses, and (4) task energizing. The “energizing” network—engaged during WM encoding and diminished in patients—exhibited consistent trend relationships with WM capacity across different data sets. The dorsolateral-prefrontal-cortex-dominated “internal attention” network exhibited some evidence of hypoactivity in patients, but was not correlated with WM performance. Conclusions: Multi-experiment analysis allowed delineation of task-related, pFC-anchored networks across different cognitive constructs. Given the results with respect to the early-responding “energizing” network, WM deficits in schizophrenia may arise from disruption in the “energization” process described by Donald Stuss' model of pFC functions.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Da Xu ◽  
Jialin Zhang ◽  
Hanxiao Xu ◽  
Yusen Zhang ◽  
Wei Chen ◽  
...  

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Kyu-Sang Lim ◽  
Qian Dong ◽  
Pamela Moll ◽  
Jana Vitkovska ◽  
Gregor Wiktorin ◽  
...  

Abstract Background Gene expression profiling in blood is a potential source of biomarkers to evaluate or predict phenotypic differences between pigs but is expensive and inefficient because of the high abundance of globin mRNA in porcine blood. These limitations can be overcome by the use of QuantSeq 3’mRNA sequencing (QuantSeq) combined with a method to deplete or block the processing of globin mRNA prior to or during library construction. Here, we validated the effectiveness of QuantSeq using a novel specific globin blocker (GB) that is included in the library preparation step of QuantSeq. Results In data set 1, four concentrations of the GB were applied to RNA samples from two pigs. The GB significantly reduced the proportion of globin reads compared to non-GB (NGB) samples (P = 0.005) and increased the number of detectable non-globin genes. The highest evaluated concentration (C1) of the GB resulted in the largest reduction of globin reads compared to the NGB (from 56.4 to 10.1%). The second highest concentration C2, which showed very similar globin depletion rates (12%) as C1 but a better correlation of the expression of non-globin genes between NGB and GB (r = 0.98), allowed the expression of an additional 1295 non-globin genes to be detected, although 40 genes that were detected in the NGB sample (at a low level) were not present in the GB library. Concentration C2 was applied in the rest of the study. In data set 2, the distribution of the percentage of globin reads for NGB (n = 184) and GB (n = 189) samples clearly showed the effects of the GB on reducing globin reads, in particular for HBB, similar to results from data set 1. Data set 3 (n = 84) revealed that the proportion of globin reads that remained in GB samples was significantly and positively correlated with the reticulocyte count in the original blood sample (P < 0.001). Conclusions The effect of the GB on reducing the proportion of globin reads in porcine blood QuantSeq was demonstrated in three data sets. In addition to increasing the efficiency of sequencing non-globin mRNA, the GB for QuantSeq has an advantage that it does not require an additional step prior to or during library creation. Therefore, the GB is a useful tool in the quantification of whole gene expression profiles in porcine blood.


2006 ◽  
Vol 18 (2) ◽  
pp. 239
Author(s):  
J. Piedrahita ◽  
S. Bischoff ◽  
J. Estrada ◽  
B. Freking ◽  
D. Nonneman ◽  
...  

Genomic imprinting arises from differential epigenetic markings including DNA methylation and histone modifications and results in one allele being expressed in a parent-of-origin specific manner. For further insight into the porcine epigenome, gene expression profiles of parthenogenetic (PRT; two maternally derived chromosome sets) and biparental embryos (BP; one maternal and one paternal set of chromosomes) were compared using microarrays. Comparison of the expression profiles of the two tissue types permits identification of both maternally and paternally imprinted genes and thus the degree of conservation of imprinted genes between swine and other mammalian species. Diploid porcine parthenogenetic fetuses were generated using follicular oocytes (BOMED, Madison, WI, USA). Oocytes with a visible polar body were activated using a single square pulse of direct current of 50 V/mm for 100 �s and diploidized by culture in 10 �g/mL cycloheximide for 6 h to limit extrusion of the second polar body. Following culture, BP embryos obtained by natural matings, and PRT embryos, were surgically transferred to oviducts on the first day of estrus. Fetuses recovered at 28-30 days of gestation were dissected to separate viscera including brain, liver, and placenta; the visceral tissues were then flash-frozen in liquid nitrogen. Porcine fibroblast tissue was obtained from the remaining carcass by mincing, trypsinization, and plating cells in �-MEM. Total RNA was extracted from frozen tissue or cell culture using RNA Aqueous kit (Ambion, Austin, TX, USA) according to the manufacturer's protocol. Gene expression differences between BP and PRT tissues were determined using the GeneChip� Porcine Genome Array (Affymetrix, Santa Clara, CA) containing 23 256 transcripts from Sus scrofa and representing 42 genes known to be imprinted in human and/or mice. Triplicate arrays were utilized for each tissue type, and for PRT versus BP combination. Significant differential gene expression was identified by a linear mixed model analysis using SAS 5.0 (SAS Institute, Cary, NC, USA). Storey's q-value method was used to correct for multiple testing at q d 0.05. The following genes were classified as imprinted on the basis of their expression profiles: In fibroblasts, ARHI, HTR2A, MEST, NDN, NNAT, PEG3, PLAGL1, PEG10, SGCE, SNRPN, and UBE3A; in liver, IGF2, PEG3, PLAGL1, PEG10, and SNRPN; in placenta, HTR2A, IGF2, MEST, NDN, NNAT, PEG3, PLAGL1, PEG10, and SNRPN; and in brain, none. Additionally, several genes not known to be imprinted in humans/mice were highly differentially expressed between the two tissue types. Overall, utilizing the PRT models and gene expression profiles, we have identified thirteen genes where imprinting is conserved between swine and humans/mice, and several candidate genes that represent potentially imprinted genes. Presently, our efforts are focused in the identification of single nucleotide polymorphisms (SNPs) to more carefully evaluate the behavior of these genes in normal and abnormal gestations and to test whether the candidate genes are indeed imprinted. This research was supported by USDA-CSREES grant 524383 to J. P. and B. F.


Blood ◽  
2015 ◽  
Vol 126 (23) ◽  
pp. 2663-2663
Author(s):  
Matthew A Care ◽  
Stephen M Thirdborough ◽  
Andrew J Davies ◽  
Peter W.M. Johnson ◽  
Andrew Jack ◽  
...  

Abstract Purpose To assess whether comparative gene network analysis can reveal characteristic immune response signatures that predict clinical response in Diffuse large B-cell lymphoma (DLBCL). Background The wealth of available gene expression data sets for DLBCL and other cancer types provides a resource to define recurrent pathological processes at the level of gene expression and gene correlation neighbourhoods. This is of particular relevance in the context of cancer immune responses, where convergence onto common patterns may drive shared gene expression profiles. Where existing and novel immunotherapies harness the immune response for therapeutic benefit such responses may provide predictive biomarkers. Methods We independently analysed publically available DLBCL gene expression data sets and a wide compendium of gene expression data from diverse cancer types, and then asked whether common elements of cancer host response could be identified from resulting networks. Using 10 DLBCL gene expression data sets, encompassing 2030 cases, we established pairwise gene correlation matrices per data set, which were merged to generate median correlations of gene pairs across all data sets. Gene network analysis and unsupervised clustering was then applied to define global representations of DLBCL gene expression neighbourhoods. In parallel a diverse range of solid and lymphoid malignancies including; breast, colorectal, oesophageal, head and neck, non-small cell lung, prostate, pancreatic cancer, Hodgkin lymphoma, Follicular lymphoma and DLBCL were independently analysed using an orthogonal weighted gene correlation network analysis of gene expression data sets from which correlated modules across diverse cancer types were identified. The biology of resulting gene neighbourhoods was assessed by signature and ontology enrichment, and the overlap between gene correlation neighbourhoods and WGCNA derived modules associated with immune/host responses was analysed. Results Amongst DLBCL data, we identified distinct gene correlation neighbourhoods associated with the immune response. These included both elements of IFN-polarised responses, core T-cell, and cytotoxic signatures as well as distinct macrophage responses. Neighbourhoods linked to macrophages separated CD163 from CD68 and CD14. In the WGCNA analysis of diverse cancer types clusters corresponding to these immune response neighbourhoods were independently identified including a highly similar cluster related to CD163. The overlapping CD163 clusters in both analyses linked to diverse Fc-Receptors, complement pathway components and patterns of scavenger receptors potentially linked to alternative macrophage activation. The relationship between the CD163 macrophage gene expression cluster and outcome was tested in DLBCL data sets, identifying a poor response in CD163 -cluster high patients, which reached statistical significance in one data set (GSE10846). Notably, the effect of the CD163-associated gene neighbourhood which correlates with poor outcome post rituximab containing immunochemotherapy is distinct from the effect of IFNG-STAT1-IRF1 polarised cytotoxic responses. The latter represents the predominant immune response pattern separating cell of origin unclassifiable (Type-III) DLBCL from either ABC or GCB DLBCL subsets, and is associated with a trend toward positive outcome. Conclusion Comparative gene expression network analysis identifies common immune response signatures shared between DLBCL and other cancer types. Gene expression clusters linked to CD163 macrophage responses and IFNG-STAT1-IRF1 polarised cytotoxic responses are common patterns with apparent divergent outcome association. Disclosures Davies: CTI: Honoraria; GIlead: Consultancy, Honoraria, Research Funding; Mundipharma: Honoraria, Research Funding; Bayer: Research Funding; Takeda: Honoraria, Research Funding; Janssen: Honoraria, Research Funding; Roche: Honoraria, Research Funding; GSK: Research Funding; Pfizer: Honoraria; Celgene: Honoraria, Research Funding. Jack:Jannsen: Research Funding.


2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2017 ◽  
Vol 42 (3) ◽  
pp. 325-343 ◽  
Author(s):  
Carol S. Walther ◽  
Dennis F. Corbin

Attitudes toward marriage equality have gradually become more accepting as more and more states have passed legislation that acknowledged full or partial recognition of marriage equality. Given the traditionally conservative behavior of the South, this article analyzes how regional migration patterns and time affect attitudes toward marriage equality from the 1988 and the 2004 to 2014 General Social Survey data sets using a generalized linear mixed model. We find that migrant southerners, migrant northerners, and native northerners are more likely to support marriage equality than native southerners are. Furthermore, time seems to also play a significant role in understanding trends in attitudes toward marriage equality. We conclude by suggesting future research.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 2763-2763 ◽  
Author(s):  
Brian S. White ◽  
Suleiman A. Khan ◽  
Muhammad Ammad-ud-din ◽  
Swapnil Potdar ◽  
Mike J Mason ◽  
...  

Abstract Introduction: Therapeutic options for patients with AML were recently expanded with FDA approval of four drugs in 2017. As their efficacy is limited in some patient subpopulations and relapse ultimately ensues, there remains an urgent need for additional treatment options tailored to well-defined patient subpopulations to achieve durable responses. Two comprehensive profiling efforts were launched to address this need-the multi-center Beat AML initiative, led by the Oregon Health & Science University (OHSU) and the AML Individualized Systems Medicine program at the Institute for Molecular Medicine Finland (FIMM). Methods: We performed a comparative analysis of the two large-scale data sets in which patient samples were subjected to whole-exome sequencing, RNA-seq, and ex vivo functional drug sensitivity screens: OHSU (121 patients and 160 drugs) and FIMM (39 patients and 480 drugs). We predicted ex vivo drug response [quantified as area under the dose-response curve (AUC)] using gene expression signatures selected with standard regression and a novel Bayesian model designed to analyze multiple data sets simultaneously. We restricted analysis to the 95 drugs in common between the two data sets. Results: The ex vivo responses (AUCs) of most drugs were positively correlated (OHSU: median Pearson correlation r across all pairwise drug comparisons=0.27; FIMM: median r=0.33). Consistently, a samples's ex vivo response to an individual drug was often correlated with the patient's Average ex vivo Drug Sensitivity (ADS), i.e., the average response across the 95 drugs (OHSU: median r across 95 drugs=0.41; FIMM: median r=0.58). Patients with a complete response to standard induction therapy had a higher ADS than those that were refractory (p=0.01). Further, patients whose ADS was in the top quartile had improved overall survival relative to those having an ADS in the bottom quartile (p<0.05). Standard regression models (LASSO and Ridge) trained on ADS and gene expression in the OHSU data set had improved ex vivo response prediction performance as assessed in the independent FIMM validation data set relative to those trained on gene expression alone (LASSO: p=2.9x10-4; Ridge: p=4.4x10-3). Overall, ex vivo drug response was relatively well predicted (LASSO: mean r across 95 drugs=0.62; Ridge: mean r=0.62). The BCL-2 inhibitor venetoclax was the only drug whose response was negatively correlated with ADS in both data sets. We hypothesized that, whereas the predictive performance of many other drugs was likely dependent on ADS, the predictive performance of venetoclax (LASSO: r=0.53, p=0.01; Ridge: r=0.63, p=1.3x10-3) reflected specific gene expression biomarkers. To identify biomarkers associated with venetoclax sensitivity, we developed an integrative Bayesian machine learning method that jointly modeled both data sets, revealing several candidate biomarkers positively (BCL2 and FLT3) or negatively (CD14, MAFB, and LRP1) correlated with venetoclax response. We assessed these biomarkers in an independent data set that profiled ex vivo response to the BCL-2/BCL-XL inhibitor navitoclax in 29 AML patients (Lee et al.). All five biomarkers were validated in the Lee data set (Fig 1). Conclusions: The two independent ex vivo functional screens were highly concordant, demonstrating the reproducibility of the assays and the opportunity for their use in the clinic. Joint analysis of the two data sets robustly identified biomarkers of drug response for BCL-2 inhibitors. Two of these biomarkers, BCL2 and the previously-reported CD14, serve as positive controls credentialing our approach. CD14, MAFB, and LRP1 are involved in monocyte differentiation. The inverse correlation of their expression with venetoclax and navitoclax response is consistent with prior reports showing that monocytic cells are resistant to BCL-2 inhibition (Kuusanmäki et al.). These biomarker panels may enable better selection of patient populations likely to respond to BCL-2 inhibition than would any one biomarker in isolation. References: Kuusanmäki et al. (2017) Single-Cell Drug Profiling Reveals Maturation Stage-Dependent Drug Responses in AML, Blood 130:3821 Lee et al. (2018) A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia, Nat Commun 9:42 Disclosures Druker: Cepheid: Consultancy, Membership on an entity's Board of Directors or advisory committees; ALLCRON: Consultancy, Membership on an entity's Board of Directors or advisory committees; Fred Hutchinson Cancer Research Center: Research Funding; Celgene: Consultancy; Vivid Biosciences: Membership on an entity's Board of Directors or advisory committees; Aileron Therapeutics: Consultancy; Third Coast Therapeutics: Membership on an entity's Board of Directors or advisory committees; Oregon Health & Science University: Patents & Royalties; Patient True Talk: Consultancy; Millipore: Patents & Royalties; Monojul: Consultancy; Gilead Sciences: Consultancy, Membership on an entity's Board of Directors or advisory committees; Amgen: Membership on an entity's Board of Directors or advisory committees; Leukemia & Lymphoma Society: Membership on an entity's Board of Directors or advisory committees, Research Funding; GRAIL: Consultancy, Membership on an entity's Board of Directors or advisory committees; Beta Cat: Membership on an entity's Board of Directors or advisory committees; MolecularMD: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; Henry Stewart Talks: Patents & Royalties; Bristol-Meyers Squibb: Research Funding; Blueprint Medicines: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; Aptose Therapeutics: Consultancy, Equity Ownership, Membership on an entity's Board of Directors or advisory committees; McGraw Hill: Patents & Royalties; ARIAD: Research Funding; Novartis Pharmaceuticals: Research Funding. Heckman:Orion Pharma: Research Funding; Novartis: Research Funding; Celgene: Research Funding. Porkka:Novartis: Honoraria, Research Funding; Celgene: Honoraria, Research Funding. Tyner:AstraZeneca: Research Funding; Incyte: Research Funding; Janssen: Research Funding; Leap Oncology: Equity Ownership; Seattle Genetics: Research Funding; Syros: Research Funding; Takeda: Research Funding; Gilead: Research Funding; Genentech: Research Funding; Aptose: Research Funding; Agios: Research Funding. Aittokallio:Novartis: Research Funding. Wennerberg:Novartis: Research Funding.


Sign in / Sign up

Export Citation Format

Share Document