scholarly journals PLIGHT: A tool to assess privacy risk by inferring identifying characteristics from sparse, noisy genotypes

2021 ◽  
Author(s):  
Prashant Siva Emani ◽  
Gamze Gursoy ◽  
Andrew David Miranker ◽  
Mark Gerstein

The leakage of identifying information in genetic and omics data has been established in many studies, with single nucleotide polymorphisms (SNPs) shown to carry a strong risk of reidentification for individuals and their genetic relatives. While the ability of thousands or hundreds of thousands of SNPs (especially rare ones) to identify individuals has been demonstrated, here we sought to measure the informativeness of even a sparse set of tens of noisy, common SNPs from an individual, by putting the genotype-based privacy leakage from an individual on quantitative footing. We present a computational tool, PLIGHT ("Privacy Leakage by Inference across Genotypic HMM Trajectories"), that employs a population-genetics-based Hidden Markov Model of recombination and mutation to find piecewise matches of a sparse query set of SNPs to a reference genotype panel. Given the ready availability of auxiliary sources of noisy genotype data -- such as acquiring small samples of environmental DNA or learning about someone's Mendelian diseases and physical characteristics -- inference on sparse data becomes a genuine concern. We explore cases where query individuals are either known to be in databases or not, and consider both simulated "mosaics" of genotypes (i.e. genotypes stitched together from diploid segments sampled from two or more source individuals) and actual genotype data obtained from swabs of coffee cups used by a known individual. Our findings are as follows: (1) Even 10 common SNPs (minor allele frequency > 0.05) often are sufficient to identify individuals in conventional genomic databases. (2) We are able to identify first-order relatives (parents, children and siblings) of query individuals with 20-30 common SNPs. (3) We find some potential for leakage of phenotypic information, based on a simulated attack by combining polygenic risk scores (PRSs) of the piecewise genotypic matches. We also found, for simulated mosaics of two individuals, that 20 common SNPs were often sufficient to find the correct identities of both component individuals. Finally, applying PLIGHT to coffee-cup-derived SNPs, we find that our tool is able identify the individual (when present in the reference database) using as little as 30 SNPs; alternatively, when the individual is not present in the reference database, we reconstruct possible genomes for the individual based on just 30-90 query SNPs by piecewise matching to the reference haplotype database. In this way, we are able to perform a small degree of imputation of unobserved query SNPs. Overall, the tool could be used to determine the value of selectively masking released SNPs, in a way that is agnostic to any explicit assumptions about underlying population membership or allele frequencies.

F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 2147
Author(s):  
Thomas R. Wood ◽  
Nathan Owens

Background: While the academic genetic literature has clearly shown that common genetic single nucleotide polymorphisms (SNPs), and even large polygenic SNP risk scores, cannot reliably be used to determine risk of disease or to personalize interventions, a significant industry of companies providing SNP-based recommendations still exists. Healthcare practitioners must therefore be able to navigate between the promise and reality of these tools, including being able to interpret the literature that is associated with a given risk or suggested intervention. One significant hurdle to this process is the fact that most population studies of common SNPs only provide average (+/- error) phenotypic or risk descriptions for a given genotype, which hides the true heterogeneity of the population and reduces the ability of an individual to determine how they themselves or their patients might truly be affected. Methods: We generated synthetic datasets generated from descriptive phenotypic data published on common SNPs associated with obesity, elevated fasting blood glucose, and methylation status. Using simple statistical theory and full graphical representation of the generated data, we developed a method by which anybody can better understand phenotypic heterogeneity in a population, as well as the degree to which common SNPs truly drive disease risk. Results: Individual risk SNPs had a <10% likelihood of effecting the associated phenotype (bodyweight, fasting glucose, or homocysteine levels). Example polygenic risk scores including the SNPs most associated with obesity and type 2 diabetes only explained 2% and 5% of the final phenotype, respectively. Conclusions: The data suggest that most disease risk is dominated by the effect of the modern environment, providing further evidence to support the pursuit of lifestyle-based interventions that are likely to be beneficial regardless of genetics.


2021 ◽  
Author(s):  
Karina Bienfait ◽  
Aparna Chhibber ◽  
Jean-Claude Marshall ◽  
Martin Armstrong ◽  
Charles Cox ◽  
...  

AbstractPharmaceutical companies have increasingly utilized genomic data for the selection of drug targets and the development of precision medicine approaches. Most major pharmaceutical companies routinely collect DNA from clinical trial participants and conduct pharmacogenomic (PGx) studies. However, the implementation of PGx studies during clinical development presents a number of challenges. These challenges include adapting to a constantly changing global regulatory environment, challenges in study design and clinical implementation, and the increasing concerns over patient privacy. Advances in the field of genomics are also providing new opportunities for pharmaceutical companies, including the availability of large genomic databases linked to patient health information, the growing use of polygenic risk scores, and the direct sequencing of clinical trial participants. The Industry Pharmacogenomics Working Group (I-PWG) is an association of pharmaceutical companies actively working in the field of pharmacogenomics. This I-PWG perspective will provide an overview of the steps pharmaceutical companies are taking to address each of these challenges, and the approaches being taken to capitalize on emerging scientific opportunities.


2021 ◽  
Vol 4 ◽  
Author(s):  
Mélissa Jaquier ◽  
Camille Albouy ◽  
Wilhelmine Bach ◽  
Conor Waldock ◽  
Viriginie Marques ◽  
...  

Islands have traditionally served as model systems to study ecological and evolutionary processes (Warren et al. 2015) and could also represent a relevant system to study environmental DNA (eDNA). Isolated island reefs that are affected by climatic threats would particularly benefit from cost- and time-efficient biodiversity surveys to set priorities for their conservation. Among time efficiency methods, eDNA has emerged as a novel molecular metabarcoding technique to detect biodiversity from simple environmental samples even in remote marine environments. However, eDNA monitoring techniques for marine environments are at a developmental phase, with a few remaining unknowns related to DNA residence time and movement. In particular, the redistribution of eDNA, via ocean currents, could blur the composition signal and its association with local environmental conditions (Goldberg et al. 2016). Here, we investigated the detection variation of eDNA along a distance gradient across four islands in the French Scattered Islands. We collected 30 L of surface water per filter at an increasing distance from the islands reefs (0m, 250m, 500m, 750m). Using a metabarcoding protocol, we used the teleo primers to target a fraction of 12S mitochondrial DNA to detect Actinopterygii and Elasmobranchii. We then applied a sequence clustering approach to generate Molecular Taxonomic Units (MOTUs), which were assigned to a taxonomic group using a reference database. By assigning eDNA sequences to species using a public reference database, we classified species according to their preferred habitat types between benthic/demersal and pelagic. Our results show no significant relationship between distance and MOTUs richness for both habitat types. By using a Joint Species Distribution Modelling approach (JSDM, Hierarchical Modelling of Species Communities), we retained the multidimensional information captured by eDNA and detect species- and family-specific responses to distance (Fig. 1). We showed that benthic MOTUs were found in closer proximity to the reef, while typical pelagic MOTUs were found at greater distances from the reef. Hence, MOTU-level analyses coupled with JSDM were more informative that when aggregating it into coarser richness. Altogether, our eDNA distance sampling gradient detected an ecological signal of habitat selection by fish species, which suggest that eDNA could help understand the behavior of species and their distribution in marine environments at a fine spatial scale.


Author(s):  
Nicole Foster ◽  
Kor-jent Dijk ◽  
Ed Biffin ◽  
Jennifer Young ◽  
Vicki Thomson ◽  
...  

A proliferation in environmental DNA (eDNA) research has increased the reliance on reference sequence databases to assign unknown DNA sequences to known taxa. Without comprehensive reference databases, DNA extracted from environmental samples cannot be correctly assigned to taxa, limiting the use of this genetic information to identify organisms in unknown sample mixtures. For animals, standard metabarcoding practices involve amplification of the mitochondrial Cytochrome-c oxidase subunit 1 (CO1) region, which is a universally amplifyable region across majority of animal taxa. This region, however, does not work well as a DNA barcode for plants and fungi, and there is no similar universal single barcode locus that has the same species resolution. Therefore, generating reference sequences has been more difficult and several loci have been suggested to be used in parallel to get to species identification. For this reason, we developed a multi-gene targeted capture approach to generate reference DNA sequences for plant taxa across 20 target chloroplast gene regions in a single assay. We successfully compiled a reference database for 93 temperate coastal plants including seagrasses, mangroves, and saltmarshes/samphire’s. We demonstrate the importance of a comprehensive reference database to prevent species going undetected in eDNA studies. We also investigate how using multiple chloroplast gene regions impacts the ability to discriminate between taxa.


2021 ◽  
Vol 9 (1) ◽  
pp. e002287
Author(s):  
Qiulun Zhou ◽  
Ying Wang ◽  
Yuqin Gu ◽  
Jing Li ◽  
Hui Wang ◽  
...  

IntroductionTo investigate associations between genetic variants related to beta-cell (BC) dysfunction or insulin resistance (IR) in type 2 diabetes (T2D) and bile acids (BAs), as well as the risk of gestational diabetes mellitus (GDM).Research design and methodsWe organized a case-control study of 230 women with GDM and 217 without GDM nested in a large prospective cohort of 22 302 Chinese women in Tianjin, China. Two weighted genetic risk scores (GRSs), namely BC-GRS and IR-GRS, were established by combining 39 and 23 single nucleotide polymorphisms known to be associated with BC dysfunction and IR, respectively. Regression and mediation analyses were performed to evaluate the relationship of GRSs with BAs and GDM.ResultsWe found that the BC-GRS was inversely associated with taurodeoxycholic acid (TDCA) after adjustment for confounders (Beta (SE)=−0.177 (0.048); p=2.66×10−4). The BC-GRS was also associated with the risk of GDM (OR (95% CI): 1.40 (1.10 to 1.77); p=0.005), but not mediated by TDCA. Compared with individuals in the low tertile of BC-GRS, the OR for GDM was 2.25 (95% CI 1.26 to 4.01) in the high tertile. An interaction effect of IR-GRS with taurochenodeoxycholic acid (TCDCA) on the risk of GDM was evidenced (p=0.005). Women with high IR-GRS and low concentration of TCDCA had a markedly higher OR of 14.39 (95% CI 1.59 to 130.16; p=0.018), compared with those with low IR-GRS and high TCDCA.ConclusionsGenetic variants related to BC dysfunction and IR in T2D potentially influence BAs at early pregnancy and the development of GDM. The identification of both modifiable and non-modifiable risk factors may facilitate the identification of high-risk individuals to prevent GDM.


2021 ◽  
Author(s):  
Gert-Jan Jeunen ◽  
Tatsiana Lipinskaya ◽  
Helen Gajduchenko ◽  
Viktoriya Golovenchik ◽  
Michail Moroz ◽  
...  

Active environmental DNA (eDNA) surveillance through species-specific amplification has shown increased sensitivity in the detection of non-indigenous species (NIS) compared to traditional approaches. When many NIS are of interest, however, active surveillance decreases in cost- and time-efficiency. Passive surveillance through eDNA metabarcoding takes advantage of the complex DNA signal in environmental samples and facilitates the simultaneous detection of multiple species. While passive eDNA surveillance has previously detected NIS, comparative studies are essential to determine the ability of eDNA metabarcoding to accurately describe the range of invasion for multiple NIS versus alternative approaches. Here, we surveyed twelve sites, covering nine rivers across Belarus for NIS with three different techniques, i.e., an ichthyological, hydrobiological, and eDNA survey, whereby DNA was extracted from 500 mL surface water samples and amplified with two 16S rRNA primer assays targeting the fish and macro-invertebrate biodiversity. Nine non-indigenous fish and ten non-indigenous sediment-living macro-invertebrates were detected by traditional surveys, while seven NIS eDNA signals were picked up, including four fish, one aquatic and two sediment-living macro-invertebrates. Passive eDNA surveillance extended the range of invasion further north for two invasive fish and identified a new NIS for Belarus, the freshwater jellyfish Craspedacusta sowerbii. False-negative detections for the eDNA survey could be attributed to (i) preferential amplification of aquatic over sediment-living macro-invertebrates from surface water samples and (ii) an incomplete reference database. The evidence provided in this study recommends the implementation of both molecular-based and traditional approaches to maximize the probability of early detection of non-native organisms.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 10502-10502
Author(s):  
Elisha Hughes ◽  
Placede Tiemeny ◽  
Shannon Gallagher ◽  
Stephanie Meek ◽  
Charis Eng ◽  
...  

10502 Background: BC risk is influenced by single-nucleotide polymorphisms (SNPs) with small effects that can be aggregated into polygenic risk scores (PRSs). PRSs have primarily been developed and validated for populations of European descent. To make a PRS available for all women, we developed and validated a novel global PRS (gPRS) that utilizes individual ancestral genetic composition. Methods: Ancestry-specific PRSs corresponding to 3 continental ancestries were developed from 149 SNPs (93 BC and 56 ancestry-informative): an African PRS was developed using a cohort of 31,126 self-reported African American patients referred for hereditary cancer testing; an East Asian PRS was developed based on published data from the Asia Breast Cancer Consortium; and a European PRS was developed using data from the Breast Cancer Association Consortium and 24,259 European hereditary cancer testing patients. For each patient, ancestry-informative SNPs were used to calculate the fractional ancestry attributable to each of the 3 continents. The gPRS was the sum of ancestry specific PRSs weighted according to genetic ancestral composition. In an independent validation cohort (N = 62,707), we evaluated discrimination and calibration of gPRS, and compared performance against a previously described 86-SNP PRS for women of European ancestry. Associations of SNPs and PRSs with BC were analyzed using logistic regression adjusted for personal and family cancer history, age, and ancestry. Odds ratios (ORs) are reported per standard deviation within the corresponding patient population. P-values are reported as two-sided. Results: The gPRS was strongly associated with BC in the full validation cohort and in sub-cohorts defined by self-reported ancestry (Table). 95% (88/93) of BC SNPs had ≥1% frequency of risk alleles within each of the self-reported populations. Compared to the aforementioned 86-SNP PRS, the gPRS showed improved discrimination overall, and within each sub-cohort, with the exception of the Asian population where the sample size was too small to show superiority of either score. The 86-SNP PRS was calibrated for white non-Hispanic women but mis-calibrated for non-European ancestries. The gPRS was properly calibrated for all women. Conclusions: The 149-SNP gPRS is validated and calibrated for women of all ancestries. Combined with clinical and biological risk factors, this approach may offer improved risk stratification for all women, regardless of ancestry.[Table: see text]


Rheumatology ◽  
2019 ◽  
Vol 59 (1) ◽  
pp. 90-98 ◽  
Author(s):  
Declan Webber ◽  
Jingjing Cao ◽  
Daniela Dominguez ◽  
Dafna D Gladman ◽  
Deborah M Levy ◽  
...  

Abstract Objective LN is one of the most common and severe manifestations of SLE. Our aim was to test the association of SLE risk loci with LN risk in childhood-onset SLE (cSLE) and adult-onset SLE (aSLE). Methods Two Toronto-based tertiary care SLE cohorts included cSLE (diagnosed &lt;18 years) and aSLE patients (diagnosed ⩾18 years). Patients met ACR and/or SLICC SLE criteria and were genotyped on the Illumina Multi-Ethnic Global Array or Omni1-Quad arrays. We identified those with and without biopsy-confirmed LN. HLA and non-HLA additive SLE risk-weighted genetic risk scores (GRSs) were tested for association with LN risk in logistic models, stratified by cSLE/aSLE and ancestry. Stratified effect estimates were meta-analysed. Results Of 1237 participants, 572 had cSLE (41% with LN) and 665 had aSLE (30% with LN). Increasing non-HLA GRS was significantly associated with increased LN risk [odds ratio (OR) = 1.26; 95% CI 1.09, 1.46; P = 0.0006], as was increasing HLA GRS in Europeans (OR = 1.55; 95% CI 1.07, 2.25; P = 0.03). There was a trend for stronger associations between both GRSs and LN risk in Europeans with cSLE compared with aSLE. When restricting cases to proliferative LN, the magnitude of these associations increased for both the non-HLA (OR = 1.30; 95% CI 1.10, 1.52; P = 0.002) and HLA GRS (OR = 1.99; 95% CI 1.29, 3.08; P = 0.002). Conclusion We observed an association between known SLE risk loci and LN risk in children and adults with SLE, with the strongest effect observed among Europeans with cSLE. Future studies will include SLE-risk single nucleotide polymorphisms specific to non-European ancestral groups and validate findings in an independent cohort.


Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 743
Author(s):  
Caiyong Yin ◽  
Kaiyuan Su ◽  
Ziwei He ◽  
Dian Zhai ◽  
Kejian Guo ◽  
...  

Y chromosomal short tandem repeats (Y-STRs) have been widely harnessed for forensic applications, such as pedigree source searching from public security databases and male identification from male–female mixed samples. For various populations, databases composed of Y-STR haplotypes have been built to provide investigating leads for solving difficult or cold cases. Recently, the supplementary application of Y chromosomal haplogroup-determining single-nucleotide polymorphisms (SNPs) for forensic purposes was under heated debate. This study provides Y-STR haplotypes for 27 markers typed by the Yfiler™ Plus kit and Y-SNP haplogroups defined by 24 loci within the Y-SNP Pedigree Tagging System for Shandong Han (n = 305) and Yunnan Han (n = 565) populations. The genetic backgrounds of these two populations were explicitly characterized by the analysis of molecular variance (AMOVA) and multi-dimensional scaling (MDS) plots based on 27 Y-STRs. Then, population comparisons were conducted by observing Y-SNP allelic frequencies and Y-SNP haplogroups distribution, estimating forensic parameters, and depicting distribution spectrums of Y-STR alleles in sub-haplogroups. The Y-STR variants, including null alleles, intermedia alleles, and copy number variations (CNVs), were co-listed, and a strong correlation between Y-STR allele variants (“DYS518~.2” alleles) and the Y-SNP haplogroup QR-M45 was observed. A network was reconstructed to illustrate the evolutionary pathway and to figure out the ancestral mutation event. Also, a phylogenetic tree on the individual level was constructed to observe the relevance of the Y-STR haplotypes to the Y-SNP haplogroups. This study provides the evidence that basic genetic backgrounds, which were revealed by both Y-STR and Y-SNP loci, would be useful for uncovering detailed population differences and, more importantly, demonstrates the contributing role of Y-SNPs in population differentiation and male pedigree discrimination.


2013 ◽  
Vol 3 (2) ◽  
pp. 13 ◽  
Author(s):  
Patricia M. Herman ◽  
Lee Sechrest

Growth curve analysis provides important informational benefits regarding intervention outcomes over time. Rarely, however, should outcome trajectories be assumed to be linear. Instead, both the shape and the slope of the growth curve can be estimated. Non-linear growth curves are usually modeled by including either higher-order time variables or orthogonal polynomial contrast codes. Each has limitations (multicollinearity with the first, a lack of coefficient interpretability with the second, and a loss of degrees of freedom with both) and neither encourages direct testing of alternative hypothesized curve shapes. Especially in studies with relatively small samples it is likely to be useful to preserve as much information as possible at the individual level. This article presents a step-by-step example of the use and testing of hypothesized curve shapes in the estimation of growth curves using hierarchical linear modeling for a small intervention study. DOI:10.2458/azu_jmmss_v3i2_herman


Sign in / Sign up

Export Citation Format

Share Document