scholarly journals Efficient Haplotype Block Partitioning and Tag SNP Selection Algorithms under Various Constraints

2013 ◽  
Vol 2013 ◽  
pp. 1-13 ◽  
Author(s):  
Wen-Pei Chen ◽  
Che-Lun Hung ◽  
Yaw-Ling Lin

Patterns of linkage disequilibrium plays a central role in genome-wide association studies aimed at identifying genetic variation responsible for common human diseases. These patterns in human chromosomes show a block-like structure, and regions of high linkage disequilibrium are called haplotype blocks. A small subset of SNPs, called tag SNPs, is sufficient to capture the haplotype patterns in each haplotype block. Previously developed algorithms completely partition a haplotype sample into blocks while attempting to minimize the number of tag SNPs. However, when resource limitations prevent genotyping all the tag SNPs, it is desirable to restrict their number. We propose two dynamic programming algorithms, incorporating many diversity evaluation functions, for haplotype block partitioning using a limited number of tag SNPs. We use the proposed algorithms to partition the chromosome 21 haplotype data. When the sample is fully partitioned into blocks by our algorithms, the 2,266 blocks and 3,260 tag SNPs are fewer than those identified by previous studies. We also demonstrate that our algorithms find the optimal solution by exploiting the nonmonotonic property of a common haplotype-evaluation function.

2018 ◽  
Vol 63 (No. 2) ◽  
pp. 61-69 ◽  
Author(s):  
M.M.I. Salem ◽  
G. Thompson ◽  
S. Chen ◽  
A. Beja-Pereira ◽  
J. Carvalheira

The objectives of this study were to estimate linkage disequilibrium (LD), describe and scan a haplotype block for the presence of genes that may affect milk production traits in Portuguese Holstein cattle. Totally 526 animals were genotyped using the Illumina BovineSNP50 BeadChip, which contained a total of 52 890 single nucleotide polymorphisms (SNPs). The final set of markers remaining after considering quality control standards consisted of 37 031 SNPs located on 29 autosomes. The LD parameters historical recombinations through allelic association (D') and squared correlation coefficient between locus alleles frequencies ( r<sup>2</sup>) were estimated and haplotype block analyses were performed using the Haploview software. The averages of D' and r<sup>2</sup> values were 0.628 and 0.122, respectively. The LD value decreased with increasing physical distance. The D' and r<sup>2</sup> values decreased respectively from 0.815 and 0.283 at the distance of 0–30 kb to 0.578 and 0.090 at the distance of 401–500 kb. The identified total number of blocks was 969 and consisted of 4259 SNPs that covered 159.06 Mb (6.24% of the total genome) on 29 autosomes. Several genes inside the haplotype blocks were detected; CSN1S2 gene in haplotype block 51 on BTA 6, IL6 and B4GALT1 genes in haplotype blocks 6 and 33 on BTA 8, IL1B and ID2 genes in haplotype blocks 19 and 29 on BTA 11, and DGAT1 gene in haplotype block 1 on BTA 14. The extension of LD using BovineSNP50 BeadChip did not exceed 500 kb and its parameters r<sup>2</sup> and D’ were less than 0.2 and 0.70, respectively, after 70–100 kb. Consequently, the 50K BeadChip would have a poor power in genome wide association studies at distances between adjacent markers lower than 70 kb.


2021 ◽  
pp. 1-11
Author(s):  
Valentina Escott-Price ◽  
Karl Michael Schmidt

<b><i>Background:</i></b> Genome-wide association studies (GWAS) were successful in identifying SNPs showing association with disease, but their individual effect sizes are small and require large sample sizes to achieve statistical significance. Methods of post-GWAS analysis, including gene-based, gene-set and polygenic risk scores, combine the SNP effect sizes in an attempt to boost the power of the analyses. To avoid giving undue weight to SNPs in linkage disequilibrium (LD), the LD needs to be taken into account in these analyses. <b><i>Objectives:</i></b> We review methods that attempt to adjust the effect sizes (β<i>-</i>coefficients) of summary statistics, instead of simple LD pruning. <b><i>Methods:</i></b> We subject LD adjustment approaches to a mathematical analysis, recognising Tikhonov regularisation as a framework for comparison. <b><i>Results:</i></b> Observing the similarity of the processes involved with the more straightforward Tikhonov-regularised ordinary least squares estimate for multivariate regression coefficients, we note that current methods based on a Bayesian model for the effect sizes effectively provide an implicit choice of the regularisation parameter, which is convenient, but at the price of reduced transparency and, especially in smaller LD blocks, a risk of incomplete LD correction. <b><i>Conclusions:</i></b> There is no simple answer to the question which method is best, but where interpretability of the LD adjustment is essential, as in research aiming at identifying the genomic aetiology of disorders, our study suggests that a more direct choice of mild regularisation in the correction of effect sizes may be preferable.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 243-244
Author(s):  
Brittany N Diehl ◽  
Andres A Pech-Cervantes ◽  
Thomas H Terrill ◽  
Ibukun M Ogunade ◽  
Owen Rae ◽  
...  

Abstract Florida Native sheep is an indigenous breed from Florida and expresses superior parasite resistance. Previous candidate and genome wide association studies with Florida Native sheep have identified single nucleotide polymorphisms with additive and non-additive effects associated with parasite resistance. However, the role of other potential DNA variants, such as copy number variants (CNVs), controlling this complex trait have not been evaluated. The objective of the present study was to investigate the importance of CNVs on resistance to natural Haemonchus contortus infections in Florida Native sheep. A total of 200 sheep were evaluated in the present study. Phenotypic records included fecal egg count (FEC, eggs/gram), FAMACHA score, and packed cell volume (PCV, %). Sheep were genotyped using the GGP Ovine 50K SNP chip. The copy number analysis was used to identify CNVs using the univariate method. A total of 170 animals with CNVs and phenotypic data were used for the association testing. Association tests were carried out using single linear regression and Principal Component Analysis (PCA) correction to identify CNVs associated with FEC, FAMACHA, and PCV. To confirm our results, a second association testing using the correlation-trend test with PCA correction was performed. Significant CNVs were detected when their adjusted p-value was &lt; 0.05 after FDR correction. A deletion CNV in chromosome 21 was associated with FEC. This DNA variant was located in intron 2 of RAB3IL gene and overlapped a QTL associated with changes in eosinophil number. Our study demonstrated for the first time that CNVs could be potentially involved with parasite resistance in this heritage sheep breed.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yuquan Wang ◽  
Tingting Li ◽  
Liwan Fu ◽  
Siqian Yang ◽  
Yue-Qing Hu

Mendelian randomization makes use of genetic variants as instrumental variables to eliminate the influence induced by unknown confounders on causal estimation in epidemiology studies. However, with the soaring genetic variants identified in genome-wide association studies, the pleiotropy, and linkage disequilibrium in genetic variants are unavoidable and may produce severe bias in causal inference. In this study, by modeling the pleiotropic effect as a normally distributed random effect, we propose a novel mixed-effects regression model-based method PLDMR, pleiotropy and linkage disequilibrium adaptive Mendelian randomization, which takes linkage disequilibrium into account and also corrects for the pleiotropic effect in causal effect estimation and statistical inference. We conduct voluminous simulation studies to evaluate the performance of the proposed and existing methods. Simulation results illustrate the validity and advantage of the novel method, especially in the case of linkage disequilibrium and directional pleiotropic effects, compared with other methods. In addition, by applying this novel method to the data on Atherosclerosis Risk in Communications Study, we conclude that body mass index has a significant causal effect on and thus might be a potential risk factor of systolic blood pressure. The novel method is implemented in R and the corresponding R code is provided for free download.


2020 ◽  
Vol 18 (3) ◽  
pp. 111-119
Author(s):  
Yinghu Zhang ◽  
Haiye Luan ◽  
Hui Zang ◽  
Hongyan Yang ◽  
Xiao Xu ◽  
...  

AbstractStarch content is an important trait in barley. To evaluate the genetic diversity and identify molecular markers of starch content in barley, 40 cultivated barley genotypes collected from different regions, including genotypes whose starch content is at either the high or low end of the spectrum (15), were used in this study. All the genotypes were re-sequenced by the double-digest-restriction associated DNA sequencing method, and a total of 299,103 single-nucleotide polymorphism (SNP) markers were obtained. The genotypes were divided into four sub-populations based on FASTSTRUCTURE, principal component analysis and neighbour-joining tree analysis. All four sub-populations had a high linkage disequilibrium, especially group 3, whose members were recently bred for malting in the Jiangsu coastal area. The starch content of the barley lines was evaluated during three growing seasons (2014–2017), and the average values of starch content across the three growing seasons at the low and high ends were 51.5 and 55.0%, respectively. The starch content was affected by population structure, the barley in group 2 had a low starch content, while the barley in group 4 had a high starch content. Twenty-six SNP markers were identified as being significantly associated with starch content (P ⩽ 0.001) based on the average values across the three growing seasons using the mixed linear model method. These SNP markers were located on chromosomes 1H and 4H, and were considered loci of qSC1-1 and qSC4-1, respectively. The major identified QTLs for starch content are helpful for further research on carbohydrates and for barley breeding.


Agronomy ◽  
2020 ◽  
Vol 10 (12) ◽  
pp. 2006
Author(s):  
David P. Horvath ◽  
Michael Stamm ◽  
Zahirul I. Talukder ◽  
Jason Fiedler ◽  
Aidan P. Horvath ◽  
...  

A diverse population (429 member) of canola (Brassica napus L.) consisting primarily of winter biotypes was assembled and used in genome-wide association studies. Genotype by sequencing analysis of the population identified and mapped 290,972 high-quality markers ranging from 18.5 to 82.4% missing markers per line and an average of 36.8%. After interpolation, 251,575 high-quality markers remained. After filtering for markers with low minor allele counts (count > 5), we were left with 190,375 markers. The average distance between these markers is 4463 bases with a median of 69 and a range from 1 to 281,248 bases. The heterozygosity among the imputed population ranges from 0.9 to 11.0% with an average of 5.4%. The filtered and imputed dataset was used to determine population structure and kinship, which indicated that the population had minimal structure with the best K value of 2–3. These results also indicated that the majority of the population has substantial sequence from a single population with sub-clusters of, and admixtures with, a very small number of other populations. Analysis of chromosomal linkage disequilibrium decay ranged from ~7 Kb for chromosome A01 to ~68 Kb for chromosome C01. Local linkage decay rates determined for all 500 kb windows with a 10kb sliding step indicated a wide range of linkage disequilibrium decay rates, indicating numerous crossover hotspots within this population, and provide a resource for determining the likely limits of linkage disequilibrium from any given marker in which to identify candidate genes. This population and the resources provided here should serve as helpful tools for investigating genetics in winter canola.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Qing Cheng ◽  
Yi Yang ◽  
Xingjie Shi ◽  
Kar-Fu Yeung ◽  
Can Yang ◽  
...  

Abstract The proliferation of genome-wide association studies (GWAS) has prompted the use of two-sample Mendelian randomization (MR) with genetic variants as instrumental variables (IVs) for drawing reliable causal relationships between health risk factors and disease outcomes. However, the unique features of GWAS demand that MR methods account for both linkage disequilibrium (LD) and ubiquitously existing horizontal pleiotropy among complex traits, which is the phenomenon wherein a variant affects the outcome through mechanisms other than exclusively through the exposure. Therefore, statistical methods that fail to consider LD and horizontal pleiotropy can lead to biased estimates and false-positive causal relationships. To overcome these limitations, we proposed a probabilistic model for MR analysis in identifying the causal effects between risk factors and disease outcomes using GWAS summary statistics in the presence of LD and to properly account for horizontal pleiotropy among genetic variants (MR-LDP) and develop a computationally efficient algorithm to make the causal inference. We then conducted comprehensive simulation studies to demonstrate the advantages of MR-LDP over the existing methods. Moreover, we used two real exposure–outcome pairs to validate the results from MR-LDP compared with alternative methods, showing that our method is more efficient in using all-instrumental variants in LD. By further applying MR-LDP to lipid traits and body mass index (BMI) as risk factors for complex diseases, we identified multiple pairs of significant causal relationships, including a protective effect of high-density lipoprotein cholesterol on peripheral vascular disease and a positive causal effect of BMI on hemorrhoids.


2017 ◽  
Vol 7 (1) ◽  
Author(s):  
Jiamei Liu ◽  
Cheng Xu ◽  
Weifeng Yang ◽  
Yayun Shu ◽  
Weiwei Zheng ◽  
...  

Abstract Binary classification is a widely employed problem to facilitate the decisions on various biomedical big data questions, such as clinical drug trials between treated participants and controls, and genome-wide association studies (GWASs) between participants with or without a phenotype. A machine learning model is trained for this purpose by optimizing the power of discriminating samples from two groups. However, most of the classification algorithms tend to generate one locally optimal solution according to the input dataset and the mathematical presumptions of the dataset. Here we demonstrated from the aspects of both disease classification and feature selection that multiple different solutions may have similar classification performances. So the existing machine learning algorithms may have ignored a horde of fishes by catching only a good one. Since most of the existing machine learning algorithms generate a solution by optimizing a mathematical goal, it may be essential for understanding the biological mechanisms for the investigated classification question, by considering both the generated solution and the ignored ones.


Sign in / Sign up

Export Citation Format

Share Document