Machine learning based prediction of gliomas with germline mutations obtained from whole exome sequences from TCGA and 1000 Genomes Project

Author(s):  
Abdulrhman Aljouie ◽  
Michael Schatz ◽  
Usman Roshan
2019 ◽  
Author(s):  
Li-Ju Wang ◽  
Catherine W. Zhang ◽  
Sophia C. Su ◽  
Hung-I H. Chen ◽  
Yu-Chiao Chiu ◽  
...  

AbstractBackgroundEuropeans and American Indians were major genetic ancestry of Hispanics in the U.S. In those ancestral groups, it has markedly different incidence rates and outcomes in many types of cancers. Therefore, the genetic admixture may cause biased genetic association study with cancer susceptibility variants specifically in Hispanics. The incidence rate and genetic mutational pattern of liver cancer have been shown substantial disparity between Hispanic, Asian and non-Hispanic white populations. Currently, ancestry informative marker (AIM) panels have been widely utilized with up to a few hundred ancestry-informative single nucleotide polymorphisms (SNPs) to infer ancestry admixture. Notably, current available AIMs are predominantly located in intron and intergenic regions, while the whole exome sequencing (WES) protocols commonly used in translational research and clinical practice do not contain these markers, thus, the challenge to accurately determine a patient’s admixture proportion without subject to additional DNA testing.MethodsHere we designed a bioinformatics pipeline to obtain an AIM panel. The panel infers 3-way genetic admixture from three distinct continental populations (African (AFR), European (EUR), and East Asian (EAS)) constraint within evolutionary-conserved exome regions. Briefly, we extract ∼1 million exonic SNPs from all individuals of three populations in the 1000 Genomes Project. Then, the SNPs were trimmed by their linkage disequilibrium (LD), restricted to biallelic variants only, and assembled as an AIM panel with the top ancestral informativeness statistics based on the In-statistic. The selected AIM panel was applied to training dataset and clinical dataset. Finally, The ancestral proportions of each individual was estimated by STRUCTURE.ResultsIn this study, the optimally selected AIM panel with 250 markers, or the UT-AIM250 panel, was performed with better accuracy as one of the published AIM panels when we tested with 3 ancestral populations (Accuracy: 0.995 ± 0.012 for AFR, 0.997 ± 0.007 for EUR, and 0.994 ± 0.012 for EAS). We demonstrated the utility of UT-AIM250 panel on the admixed American (AMR) of the 1000 Genomes Project and obtained similar results (AFR: 0.085 ± 0.098; EUR: 0.665 ± 0.182; and EAS 0.250 ± 0.205) to previously published AIM panels (Phillips-AIM34: AFR: 0.096 ± 0.127, EUR: 0.575 ± 0.29; and EAS: 0.330 ± 0.315; Wei-AIM278: AFR: 0.070 ± 0.096, EUR: 0.537 ± 0.267, and EAS: 0.393 ± 0.300) with no significant difference (Pearson correlation, P < 10-50, n = 347 samples). Subsequently, we applied UT-AIM250 panel to clinical datasets of self-reported Hispanic patients in South Texas with hepatocellular carcinoma (26 patients). Our estimated admixture proportions from adjacent non-cancer liver tissue data of Hispanics in South Texas is (AFR: 0.065 ± 0.043; EUR: 0.594 ± 0.150; and EAS: 0.341 ± 0.160), with smaller variation due to its unique Texan/Mexican American population in South Texas. Similar admixture proportion from the corresponding tumor tissue we also obtained. In addition, we estimated admixture proportions of entire TCGA-LIHC samples (376 patients) using UT-AIM250 panel. We demonstrated that our AIM panel estimate consistent admixture proportions from DNAs derived from tumor and normal tissues, and 2 possible incorrect reported race/ethnicity, and/or provide race/ethnicity determination if necessary.ConclusionsTaken together, we demonstrated the feasibility of using evolutionary-conserved exome regions to distinguish genetic ancestry descendants based on 3 continental-ancestry proportion, provided a robust and reliable control for sample collection or patient stratification for genetic analysis. R implementation of UT-AIM250 is available at https://github.com/chenlabgccri/UT-AIM250.


BMC Genomics ◽  
2019 ◽  
Vol 20 (S12) ◽  
Author(s):  
Li-Ju Wang ◽  
Catherine W. Zhang ◽  
Sophia C. Su ◽  
Hung-I H. Chen ◽  
Yu-Chiao Chiu ◽  
...  

Abstract Background Europeans and American Indians were major genetic ancestry of Hispanics in the U.S. These ancestral groups have markedly different incidence rates and outcomes in many types of cancers. Therefore, the genetic admixture may cause biased genetic association study with cancer susceptibility variants specifically in Hispanics. For example, the incidence rate of liver cancer has been shown with substantial disparity between Hispanic, Asian and non-Hispanic white populations. Currently, ancestry informative marker (AIM) panels have been widely utilized with up to a few hundred ancestry-informative single nucleotide polymorphisms (SNPs) to infer ancestry admixture. Notably, current available AIMs are predominantly located in intron and intergenic regions, while the whole exome sequencing (WES) protocols commonly used in translational research and clinical practice do not cover these markers. Thus, it remains challenging to accurately determine a patient’s admixture proportion without additional DNA testing. Results In this study we designed an unique AIM panel that infers 3-way genetic admixture from three distinct and selective continental populations (African (AFR), European (EUR), and East Asian (EAS)) within evolutionarily conserved exonic regions. Initially, about 1 million exonic SNPs from selective three populations in the 1000 Genomes Project were trimmed by their linkage disequilibrium (LD), restricted to biallelic variants, and finally we optimized to an AIM panel with 250 SNP markers, or the UT-AIM250 panel, using their ancestral informativeness statistics. Comparing to published AIM panels, UT-AIM250 performed better accuracy when we tested with three ancestral populations (accuracy: 0.995 ± 0.012 for AFR, 0.997 ± 0.007 for EUR, and 0.994 ± 0.012 for EAS). We further demonstrated the performance of the UT-AIM250 panel to admixed American (AMR) samples of the 1000 Genomes Project and obtained similar results (AFR, 0.085 ± 0.098; EUR, 0.665 ± 0.182; and EAS, 0.250 ± 0.205) to previously published AIM panels (Phillips-AIM34: AFR, 0.096 ± 0.127, EUR, 0.575 ± 0.290, and EAS, 0.330 ± 0.315; Wei-AIM278: AFR, 0.070 ± 0.096, EUR, 0.537 ± 0.267, and EAS, 0.393 ± 0.300). Subsequently, we applied the UT-AIM250 panel to a clinical dataset of 26 self-reported Hispanic patients in South Texas with hepatocellular carcinoma (HCC). We estimated the admixture proportions using WES data of adjacent non-cancer liver tissues (AFR, 0.065 ± 0.043; EUR, 0.594 ± 0.150; and EAS, 0.341 ± 0.160). Similar admixture proportions were identified from corresponding tumor tissues. In addition, we estimated admixture proportions of The Cancer Genome Atlas (TCGA) collection of hepatocellular carcinoma (TCGA-LIHC) samples (376 patients) using the UT-AIM250 panel. The panel obtained consistent admixture proportions from tumor and matched normal tissues, identified 3 possible incorrectly reported race/ethnicity, and/or provided race/ethnicity determination if necessary. Conclusions Here we demonstrated the feasibility of using evolutionarily conserved exonic regions to infer admixture proportions and provided a robust and reliable control for sample collection or patient stratification for genetic analysis. R implementation of UT-AIM250 is available at https://github.com/chenlabgccri/UT-AIM250.


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 3352-3352
Author(s):  
Jefferson L. Lansford ◽  
Shengjie Chai ◽  
Gheath Alatrash ◽  
Jeffrey J. Molldrem ◽  
Paul M. Armistead ◽  
...  

Abstract Background and Rationale: T-cell responses to minor histocompatibility antigens (mHA) are important drivers of the beneficial graft versus leukemia (GvL) effect and harmful graft versus host disease (GvHD) pathology following HLA-matched allogeneic stem cell transplantation (alloSCT). Despite their importance, Genome Wide Association Studies (GWAS) have failed to elicit a prognostic set of individual mHA associated with the clinically observed GvL and GvH effects of alloSCT [Sato-Otsubo et al. Blood 2015]. This is likely due to a lack of public mHA shared across the global set of donor/recipient pairs (DRPs). Even an optimally frequent single nucleotide polymorphism (SNP) would result in a recipient restricted genetic variant in only 24% of DRPs with a matched unrelated donor (MUD) or 17% of DRPs with a matched related donor (MRD) [Armistead et al. PLoS One 2011]. Moreover, even when DRPs do share a recipient restricted genetic variant, any resulting peptide would have to bind a MHC molecule within the recipient for presentation to T-cells. Methods: To evaluate the role of mHA in alloSCT without requiring public mHA, we developed a bioinformatics pipeline to predict mHA peptides based on SNP differences taking into account peptide/MHC binding estimated by netMHCpan [Nielson et al. Genome Med 2016] and expected GvL vs GvH tissue expression derived from mRNA sequencing data from acute myeloid leukemia (AML) and normal hematopoietic, hepatobiliary, skin, and gastrointestinal tract tissues. In order to understand the distribution of mHA in HLA-matched alloSCT drawn from a diverse pool of DRPs, we applied this analysis to putative DRPs drawn from healthy individuals who underwent whole-exome sequencing as part of the 1000 Genomes Project (n = 1,916) [1000 Genomes Project Consortium et al. Nature 2015]. To evaluate the association of mHA with AML relapse and GvHD incidence following alloSCT, we predicted mHA from SNP data using clinical information from an earlier study [Armistead et al. PLoS One 2011]. Results: To determine a baseline for the number of predicted mHA in an alloSCT, we considered every possible pair of samples in the 1000 Genomes data set as a theoretical alloSCT (n = 3,669,140). We determined each sample's HLA type using PHLAT [Bai et al. BMC Genomics 2014] and then performed mHA predictions for all theoretical transplants with a 10 out of 10 HLA match (n = 10). Within each ethnicity represented in the 1000 Genomes data, the degree of HLA matching was greater than that of all ethnicities pooled, with the lowest HLA diversity in the Finnish, Chinese, and Japanese populations. The number of predicted mHA binding MHC with Kd < 500nM ranged from 6,217 to 13,545 in the 1000 Genomes theoretical HLA-matched DRPs, with mHA contained in genes selectively expressed in AML versus those selectively expressed in GvH target organs numbering 213 to 610 and 367 to 1,135, respectively. HLA-A*02:01 restricted mHA were predicted for 37 actual DRPs in the context of MUD alloSCT and 97 DRPs from MRD alloSCT [Armistead et al. PLoS One 2011]. There were significantly more predicted mHA in MUD transplants (Figure 1). Taking into account both predicted peptide/MHC binding and tissue expression, the number of predicted GvL mHA was significantly associated with remission and the aggregate number of predicted GvH mHA was significantly associated with grade 2-4 GvHD incidence in MUD transplants (Figure 2). Conclusions: Prediction of mHA based on whole-exome sequencing data is feasible and can be used to discover associations of mHA distribution features with clinical outcomes including AML remission and GvHD incidence. Future work in larger datasets will be required to validate these predicted associations and guide development of mHA-directed therapeutics. Disclosures Molldrem: Astellas Pharma: Patents & Royalties.


2021 ◽  
Vol 11 (3) ◽  
pp. 231
Author(s):  
Faven Butler ◽  
Ali Alghubayshi ◽  
Youssef Roman

Gout is an inflammatory condition caused by elevated serum urate (SU), a condition known as hyperuricemia (HU). Genetic variations, including single nucleotide polymorphisms (SNPs), can alter the function of urate transporters, leading to differential HU and gout prevalence across different populations. In the United States (U.S.), gout prevalence differentially affects certain racial groups. The objective of this proposed analysis is to compare the frequency of urate-related genetic risk alleles between Europeans (EUR) and the following major racial groups: Africans in Southwest U.S. (ASW), Han-Chinese (CHS), Japanese (JPT), and Mexican (MXL) from the 1000 Genomes Project. The Ensembl genome browser of the 1000 Genomes Project was used to conduct cross-population allele frequency comparisons of 11 SNPs across 11 genes, physiologically involved and significantly associated with SU levels and gout risk. Gene/SNP pairs included: ABCG2 (rs2231142), SLC2A9 (rs734553), SLC17A1 (rs1183201), SLC16A9 (rs1171614), GCKR (rs1260326), SLC22A11 (rs2078267), SLC22A12 (rs505802), INHBC (rs3741414), RREB1 (rs675209), PDZK1 (rs12129861), and NRXN2 (rs478607). Allele frequencies were compared to EUR using Chi-Square or Fisher’s Exact test, when appropriate. Bonferroni correction for multiple comparisons was used, with p < 0.0045 for statistical significance. Risk alleles were defined as the allele that is associated with baseline or higher HU and gout risks. The cumulative HU or gout risk allele index of the 11 SNPs was estimated for each population. The prevalence of HU and gout in U.S. and non-US populations was evaluated using published epidemiological data and literature review. Compared with EUR, the SNP frequencies of 7/11 in ASW, 9/11 in MXL, 9/11 JPT, and 11/11 CHS were significantly different. HU or gout risk allele indices were 5, 6, 9, and 11 in ASW, MXL, CHS, and JPT, respectively. Out of the 11 SNPs, the percentage of risk alleles in CHS and JPT was 100%. Compared to non-US populations, the prevalence of HU and gout appear to be higher in western world countries. Compared with EUR, CHS and JPT populations had the highest HU or gout risk allele frequencies, followed by MXL and ASW. These results suggest that individuals of Asian descent are at higher HU and gout risk, which may partly explain the nearly three-fold higher gout prevalence among Asians versus Caucasians in ambulatory care settings. Furthermore, gout remains a disease of developed countries with a marked global rising.


Author(s):  
Magdalena Kukla-Bartoszek ◽  
Paweł Teisseyre ◽  
Ewelina Pośpiech ◽  
Joanna Karłowska-Pik ◽  
Piotr Zieliński ◽  
...  

AbstractIncreasing understanding of human genome variability allows for better use of the predictive potential of DNA. An obvious direct application is the prediction of the physical phenotypes. Significant success has been achieved, especially in predicting pigmentation characteristics, but the inference of some phenotypes is still challenging. In search of further improvements in predicting human eye colour, we conducted whole-exome (enriched in regulome) sequencing of 150 Polish samples to discover new markers. For this, we adopted quantitative characterization of eye colour phenotypes using high-resolution photographic images of the iris in combination with DIAT software analysis. An independent set of 849 samples was used for subsequent predictive modelling. Newly identified candidates and 114 additional literature-based selected SNPs, previously associated with pigmentation, and advanced machine learning algorithms were used. Whole-exome sequencing analysis found 27 previously unreported candidate SNP markers for eye colour. The highest overall prediction accuracies were achieved with LASSO-regularized and BIC-based selected regression models. A new candidate variant, rs2253104, located in the ARFIP2 gene and identified with the HyperLasso method, revealed predictive potential and was included in the best-performing regression models. Advanced machine learning approaches showed a significant increase in sensitivity of intermediate eye colour prediction (up to 39%) compared to 0% obtained for the original IrisPlex model. We identified a new potential predictor of eye colour and evaluated several widely used advanced machine learning algorithms in predictive analysis of this trait. Our results provide useful hints for developing future predictive models for eye colour in forensic and anthropological studies.


2014 ◽  
Vol 6 (4) ◽  
pp. 846-860 ◽  
Author(s):  
Gabriel Santpere ◽  
Fleur Darre ◽  
Soledad Blanco ◽  
Antonio Alcami ◽  
Pablo Villoslada ◽  
...  

2015 ◽  
Vol 32 (9) ◽  
pp. 1366-1372 ◽  
Author(s):  
Dmitry Prokopenko ◽  
Julian Hecker ◽  
Edwin K. Silverman ◽  
Marcello Pagano ◽  
Markus M. Nöthen ◽  
...  

Author(s):  
Juan Chen ◽  
Yan Li ◽  
Jianlei Wu ◽  
Yakun Liu ◽  
Shan Kang

Abstract Background Malignant ovarian germ cell tumors (MOGCTs) are rare and heterogeneous ovary tumors. We aimed to identify potential germline mutations and somatic mutations in MOGCTs by whole-exome sequencing. Methods The peripheral blood and tumor samples from these patients were used to identify germline mutations and somatic mutations, respectively. For those genes corresponding to copy number alterations (CNA) deletion and duplication region, functional annotation of was performed. Immunohistochemistry was performed to evaluate the expression of mutated genes corresponding to CNA deletion region. Results In peripheral blood, copy number loss and gain were mostly found in yolk sac tumors (YST). Moreover, POU5F1 was the most significant mutated gene with mutation frequency &gt; 10% in both CNA deletion and duplication region. In addition, strong cytoplasm staining of POU5F1 (corresponding to CNA deletion region) was found in 2 YST and nuclear staining in 2 dysgerminomas (DG) tumor samples. Genes corresponding to CNA deletion region were significantly enriched in the signaling pathway of regulating pluripotency of stem cells. In addition, genes corresponding to CNA duplication region were significantly enriched in the signaling pathways of RIG-I-like receptor, Toll-like receptor, NF-kappa B and Jak–STAT. KRT4, RPL14, PCSK6, PABPC3 and SARM1 mutations were detected in both peripheral blood and tumor samples. Conclusions Identification of potential germline mutations and somatic mutations in MOGCTs may provide a new field in understanding the genetic feature of the rare biological tumor type in the ovary.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0254363
Author(s):  
Aji John ◽  
Kathleen Muenzen ◽  
Kristiina Ausmees

Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.


Sign in / Sign up

Export Citation Format

Share Document