scholarly journals Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Danqing Xu ◽  
Chen Wang ◽  
Atlas Khan ◽  
Ning Shang ◽  
Zihuai He ◽  
...  

AbstractLabeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

2017 ◽  
Vol 63 (1) ◽  
pp. 186-195 ◽  
Author(s):  
Benjamin M Scirica

Abstract BACKGROUND As the worldwide prevalence of type 2 diabetes mellitus (T2DM) increases, it is even more important to develop cost-effective methods to predict and diagnose the onset of diabetes, monitor progression, and risk stratify patients in terms of subsequent cardiovascular and diabetes complications. CONTENT Nonlaboratory clinical risk scores based on risk factors and anthropomorphic data can help identify patients at greatest risk of developing diabetes, but glycemic indices (hemoglobin A1c, fasting plasma glucose, and oral glucose tolerance tests) are the cornerstones for diagnosis, and the basis for monitoring therapy. Although family history is a strong predictor of T2DM, only small populations of patients carry clearly identifiable genetic mutations. Better modalities for detection of insulin resistance would improve earlier identification of dysglycemia and guide effective therapy based on therapeutic mechanisms of action, but improved standardization of insulin assays will be required. Although clinical risk models can stratify patients for subsequent cardiovascular risk, the addition of cardiac biomarkers, in particular, high-sensitivity troponin and natriuretic peptide provide, significantly improves model performance and risk stratification. CONCLUSIONS Much more research, prospectively planned and with clear treatment implications, is needed to define novel biomarkers that better identify the underlying pathogenic etiologies of dysglycemia. When compared with traditional risk features, biomarkers provide greater discrimination of future risk, and the integration of cardiac biomarkers should be considered part of standard risk stratification in patients with T2DM.


2016 ◽  
Author(s):  
Alicia R. Martin ◽  
Christopher R. Gignoux ◽  
Raymond K. Walters ◽  
Genevieve L. Wojcik ◽  
Benjamin M. Neale ◽  
...  

AbstractThe vast majority of genome-wide association studies are performed in Europeans, and their transferability to other populations is dependent on many factors (e.g. linkage disequilibrium, allele frequencies, genetic architecture). As medical genomics studies become increasingly large and diverse, gaining insights into population history and consequently the transferability of disease risk measurement is critical. Here, we disentangle recent population history in the widely-used 1000 Genomes Project reference panel, with an emphasis on populations underrepresented in medical studies. To examine the transferability of single-ancestry GWAS, we used published summary statistics to calculate polygenic risk scores for six well-studied traits and diseases. We identified directional inconsistencies in all scores; for example, height is predicted to decrease with genetic distance from Europeans, despite robust anthropological evidence that West Africans are as tall as Europeans on average. To gain deeper quantitative insights into GWAS transferability, we developed a complex trait coalescent-based simulation framework considering effects of polygenicity, causal allele frequency divergence, and heritability. As expected, correlations between true and inferred risk were typically highest in the population from which summary statistics were derived. We demonstrated that scores inferred from European GWAS were biased by genetic drift in other populations even when choosing the same causal variants, and that biases in any direction were possible and unpredictable. This work cautions that summarizing findings from large-scale GWAS may have limited portability to other populations using standard approaches, and highlights the need for generalized risk prediction methods and the inclusion of more diverse individuals in medical genomics.


2020 ◽  
Author(s):  
William P.T.M. van Doorn ◽  
Patricia M. Stassen ◽  
Hella F. Borggreve ◽  
Maaike J. Schalkwijk ◽  
Judith Stoffers ◽  
...  

AbstractIntroductionPatients with sepsis who present to an emergency department (ED) have highly variable underlying disease severity, and can be categorized from low to high risk. Development of a risk stratification tool for these patients is important for appropriate triage and early treatment. The aim of this study was to develop machine learning models predicting 31-day mortality in patients presenting to the ED with sepsis and to compare these to internal medicine physicians and clinical risk scores.MethodsA single-center, retrospective cohort study was conducted amongst 1,344 emergency department patients fulfilling sepsis criteria. Laboratory and clinical data that was available in the first two hours of presentation from these patients were randomly partitioned into a development (n=1,244) and validation dataset (n=100). Machine learning models were trained and evaluated on the development dataset and compared to internal medicine physicians and risk scores in the independent validation dataset. The primary outcome was 31-day mortality.ResultsA number of 1,344 patients were included of whom 174 (13.0%) died. Machine learning models trained with laboratory or a combination of laboratory + clinical data achieved an area-under-the ROC curve of 0.82 (95% CI: 0.80-0.84) and 0.84 (95% CI: 0.81-0.87) for predicting 31-day mortality, respectively. In the validation set, models outperformed internal medicine physicians and clinical risk scores in sensitivity (92% vs. 72% vs. 78%;p<0.001,all comparisons) while retaining comparable specificity (78% vs. 74% vs. 72%;p>0.02). The model had higher diagnostic accuracy with an area-under-the-ROC curve of 0.85 (95%CI: 0.78-0.92) compared to abbMEDS (0.63,0.54-0.73), mREMS (0.63,0.54-0.72) and internal medicine physicians (0.74,0.65-0.82).ConclusionMachine learning models outperformed internal medicine physicians and clinical risk scores in predicting 31-day mortality. These models are a promising tool to aid in risk stratification of patients presenting to the ED with sepsis.


2021 ◽  
Author(s):  
Ying Wang ◽  
Shinichi Namba ◽  
Esteban Lopera ◽  
Sini Kerminen ◽  
Kristin Tsuo ◽  
...  

SummaryWith the increasing availability of biobank-scale datasets that incorporate both genomic data and electronic health records, many associations between genetic variants and phenotypes of interest have been discovered. Polygenic risk scores (PRS), which are being widely explored in precision medicine, use the results of association studies to predict the genetic component of disease risk by accumulating risk alleles weighted by their effect sizes. However, limited studies have thoroughly investigated best practices for PRS in global populations across different diseases. In this study, we utilize data from the Global-Biobank Meta-analysis Initiative (GBMI), which consists of individuals from diverse ancestries and across continents, to explore methodological considerations and PRS prediction performance in 9 different biobanks for 14 disease endpoints. Specifically, we constructed PRS using heuristic (pruning and thresholding, P+T) and Bayesian (PRS-CS) methods. We found that the genetic architecture, such as SNP-based heritability and polygenicity, varied greatly among endpoints. For both PRS construction methods, using a European ancestry LD reference panel resulted in comparable or higher prediction accuracy compared to several other non-European based panels; this is largely attributable to European descent populations still comprising the majority of GBMI participants. PRS-CS overall outperformed the classic P+T method, especially for endpoints with higher SNP-based heritability. For example, substantial improvements are observed in East-Asian ancestry (EAS) using PRS-CS compared to P+T for heart failure (HF) and chronic obstructive pulmonary disease (COPD). Notably, prediction accuracy is heterogeneous across endpoints, biobanks, and ancestries, especially for asthma which has known variation in disease prevalence across global populations. Overall, we provide lessons for PRS construction, evaluation, and interpretation using the GBMI and highlight the importance of best practices for PRS in the biobank-scale genomics era.


2021 ◽  
Author(s):  
Yixuan He ◽  
Chirag M Lakhani ◽  
Danielle Rasooly ◽  
Arjun K Manrai ◽  
Ioanna Tzoulaki ◽  
...  

OBJECTIVE: <p>Establish a polyexposure score for T2D incorporating 12 non-genetic exposure and examine whether a polyexposure and/or a polygenic risk score improves diabetes prediction beyond traditional clinical risk factors.</p> <h2><a></a>RESEARCH DESIGN AND METHODS:</h2> <p>We identified 356,621 unrelated individuals from the UK Biobank of white British ancestry with no prior diagnosis of T2D and normal HbA1c levels. Using self-reported and hospital admission information, we deployed a machine learning procedure to select the most predictive and robust factors out of 111 non-genetically ascertained exposure and lifestyle variables for the polyexposure risk score (PXS) in prospective T2D. We computed the clinical risk score (CRS) and polygenic risk score (PGS) by taking a weighted sum of eight established clinical risk factors and over six million SNPs, respectively.</p> <h2><a></a>RESULTS:</h2> <p>In the study population, 7,513 had incident T2D. The C-statistics for the PGS, PXS, and CRS models were 0.709, 0.762, and 0.839, respectively. Hazard ratios (HR) associated with risk score values in the top 10% percentile versus the remaining population is 2.00, 5.90, and 9.97 for PGS, PXS, and CRS respectively. Addition of PGS and PXS to CRS improves T2D classification accuracy with a continuous net reclassification index of 15.2% and 30.1% for cases, respectively, and 7.3% and 16.9% for controls, respectively. </p> <h2><a></a>CONCLUSIONS:</h2> <p>For T2D, the PXS provides modest incremental predictive value over established clinical risk factors. The concept of PXS merits further consideration in T2D risk stratification and is likely to be useful in other chronic disease risk prediction models.</p>


2015 ◽  
Vol 5 (1) ◽  
Author(s):  
Muhammad S. Ahmad ◽  
Zoheir A. Damanhouri ◽  
Torben Kimhofer ◽  
Hala H. Mosli ◽  
Elaine Holmes

Abstract Advanced glycation endproducts (AGEs) are believed to play a significant role in the pathophysiology of a variety of diseases including diabetes and cardiovascular diseases. Non-invasive skin autofluorescence (SAF) measurement serves as a proxy for tissue accumulation of AGEs. We assessed reference SAF and skin reflectance (SR) values in a Saudi population (n = 1,999) and evaluated the existing risk stratification scale. The mean SAF of the study cohort was 2.06 (SD = 0.57) arbitrary units (AU), which is considerably higher than the values reported for other populations. We show a previously unreported and significant difference in SAF values between men and women, with median (range) values of 1.77 AU (0.79–4.84 AU) and 2.20 AU (0.75–4.59 AU) respectively (p-value « 0.01). Age, presence of diabetes and BMI were the most influential variables in determining SAF values in men, whilst in female participants, SR was also highly correlated with SAF. Diabetes, hypertension and obesity all showed strong association with SAF, particularly when gender differences were taken into account. We propose an adjusted, gender-specific disease risk stratification scheme for Middle Eastern populations. SAF is a potentially valuable clinical screening tool for cardiovascular risk assessment but risk scores should take gender and ethnicity into consideration for accurate diagnosis.


2020 ◽  
Vol 65 (No. 12) ◽  
pp. 445-453
Author(s):  
Anita Klímová ◽  
Eva Kašná ◽  
Karolína Machová ◽  
Michaela Brzáková ◽  
Josef Přibyl ◽  
...  

The inclusion of animal genotype data has contributed to the development of genomic selection. Animals are selected not only based on pedigree and phenotypic data but also on the basis of information about their genotypes. Genomic information helps to increase the accuracy of selection of young animals and thus enables a reduction of the generation interval. Obtaining information about genotypes in the form of SNPs (single nucleotide polymorphisms) has led to the development of new chips for genotyping. Several methods of genomic comparison have been developed as a result. One of the methods is data imputation, which allows the missing SNPs to be calculated using low-density chips to high-density chips. Through imputations, it is possible to combine information from diverse sets of chips and thus obtain more information about genotypes at a lower cost. Increasing the amount of data helps increase the reliability of predicting genomic breeding values. Imputation methods are increasingly used in genome-wide association studies. When classical genotyping and genome-wide sequencing data are combined, this option helps to increase the chances of identifying loci that are associated with economically significant traits.


PLoS Genetics ◽  
2021 ◽  
Vol 17 (9) ◽  
pp. e1009670
Author(s):  
Lars G. Fritsche ◽  
Ying Ma ◽  
Daiwei Zhang ◽  
Maxwell Salvatore ◽  
Seunggeun Lee ◽  
...  

Polygenic risk scores (PRS) can provide useful information for personalized risk stratification and disease risk assessment, especially when combined with non-genetic risk factors. However, their construction depends on the availability of summary statistics from genome-wide association studies (GWAS) independent from the target sample. For best compatibility, it was reported that GWAS and the target sample should match in terms of ancestries. Yet, GWAS, especially in the field of cancer, often lack diversity and are predominated by European ancestry. This bias is a limiting factor in PRS research. By using electronic health records and genetic data from the UK Biobank, we contrast the utility of breast and prostate cancer PRS derived from external European-ancestry-based GWAS across African, East Asian, European, and South Asian ancestry groups. We highlight differences in the PRS distributions of these groups that are amplified when PRS methods condense hundreds of thousands of variants into a single score. While European-GWAS-derived PRS were not directly transferrable across ancestries on an absolute scale, we establish their predictive potential when considering them separately within each group. For example, the top 10% of the breast cancer PRS distributions within each ancestry group each revealed significant enrichments of breast cancer cases compared to the bottom 90% (odds ratio of 2.81 [95%CI: 2.69,2.93] in European, 2.88 [1.85, 4.48] in African, 2.60 [1.25, 5.40] in East Asian, and 2.33 [1.55, 3.51] in South Asian individuals). Our findings highlight a compromise solution for PRS research to compensate for the lack of diversity in well-powered European GWAS efforts while recruitment of diverse participants in the field catches up.


2017 ◽  
Author(s):  
Yanran Wang ◽  
Yuri Astrakhan ◽  
Britt-Sabina Petersen ◽  
Stefan Schreiber ◽  
Andre Franke ◽  
...  

AbstractBackgroundAfter many years of concentrated research efforts, the exact cause of Crohn’s disease remains unknown. Its accurate diagnosis, however, helps in management and even preventing the onset of disease. Genome-wide association studies have identified 140 loci associated with CD, but these carry very small log odds ratios and are uninformative for diagnoses.ResultsHere we describe a machine learning method – AVA,Dx (Analysis of Variation for Association with Disease) – that uses whole exome sequencing data to make predictions of CD status. Using the person-specific variation in these genes from a panel of only 111 individuals, we built disease-prediction models informative of previously undiscovered disease genes. In this panel, our models differentiate CD patients from healthy controls with 71% precision and 73% recall at the default cutoff. By additionally accounting for batch effects, we are also able to predict individual CD status for previously unseen individuals from a separate CD study (84% precision, 73% recall).ConclusionsLarger training panels and additional features, including regulatory variants and environmental factors, e.g. human-associated microbiota, are expected to improve model performance. However, current results already position AVA,Dx as both an effective method for highlighting pathogenesis pathways and as a simple Crohn’s disease risk analysis tool, which can improve clinical diagnostic time and accuracy.


Author(s):  
Jeremiah H. Li ◽  
Chase A. Mazur ◽  
Tomaz Berisa ◽  
Joseph K. Pickrell

AbstractLow-pass sequencing (sequencing a genome to an average depth less than 1x coverage) combined with genotype imputation has been proposed as an alternative to genotyping arrays for trait mapping and calculation of polygenic scores; however, the current literature is largely limited to simulation- and downsampling-based approaches. To empirically assess the relative performance of these technologies for different applications, we performed low-pass sequencing (targeting coverage levels of 0.5x and 1x) and array genotyping (using the Illumina Global Screening Array) on 120 DNA samples derived from African and European-ancestry individuals that are part of the 1000 Genomes Project. We then imputed both the sequencing data and the genotyping array data to the 1000 Genomes Phase 3 haplotype reference panel using a leave-one-out design. First, we evaluated overall imputation accuracy from these different assays as measured by genotype concordance; we introduce the concept of effective coverage that accounts for evenness of sequencing and show that this metric is a better predictor of imputation accuracy than nominal mapped coverage for low-pass sequencing data. Next, we evaluated overall power for genome-wide association studies (GWAS) as measured by the squared correlation between imputed and true genotypes. In the African individuals, at common variants (> 5% minor allele frequency), imputation r2 averaged 0.83 for the array data and ranged from 0.89 to 0.95 for the low-pass sequencing data, corresponding to an effective 7 – 15% increase in GWAS discovery power. For the same variants in the European individuals, imputation r2 averaged 0.91 for the array data and ranged from 0.92-0.96 for the low-pass sequencing data, corresponding to an effective 1-6% increase in GWAS discovery power. Finally, we computed polygenic risk scores for breast cancer and coronary artery disease from the different assays. We observed consistently lower measurement error for risk scores computed from low-pass sequencing data above an effective coverage of ∼ 0.5x. The mean squared error of the array-based estimates was three to four times that of the estimates from samples sequenced at an effective coverage of ∼ 1.2x for coronary artery disease, with qualitatively similar results for breast cancer. We conclude that low-pass sequencing plus imputation, in addition to providing a substantial increase in statistical power for genome wide association studies, provides increased accuracy for polygenic risk prediction at effective coverages of ∼ 0.5x and higher.


Sign in / Sign up

Export Citation Format

Share Document