An enrichment method for mapping ambiguous reads to the reference genome for NGS analysis

2019 ◽  
Vol 17 (06) ◽  
pp. 1940012
Author(s):  
Yuan Liu ◽  
Yongchao Ma ◽  
Evan Salsman ◽  
Frank A. Manthey ◽  
Elias M. Elias ◽  
...  

Mapping short reads to a reference genome is an essential step in many next-generation sequencing (NGS) analyses. In plants with large genomes, a large fraction of the reads can align to multiple locations of the genome with equally good alignment scores. How to map these ambiguous reads to the genome is a challenging problem with big impacts on the downstream analysis. Traditionally, the default method is to assign an ambiguous read randomly to one of the many potential locations. In this study, we explore two alternative methods that are based on the hypothesis that the possibility of an ambiguous read being generated by a location is proportional to the total number of reads produced by that location: (1) the enrichment method that assigns an ambiguous read to the location that has produced the most reads among all the potential locations, (2) the probability method that assigns an ambiguous read to a location based on a probability proportional to the number of reads the location produces. We systematically compared the performance of the proposed methods with that of the default random method. Our results showed that the enrichment method produced better results than the default random method and the probability method in the discovery of single nucleotide polymorphisms (SNPs). Not only did it produce more SNP markers, but it also produced SNP markers with better quality, which was demonstrated using multiple mainstay genomic analyses, including genome-wide association studies (GWAS), minor allele distribution, population structure, and genomic prediction.

10.29007/kw3c ◽  
2019 ◽  
Author(s):  
Yuan Liu ◽  
Yongchao Ma ◽  
Evan Salsman ◽  
Frank Manthey ◽  
Elias Elias ◽  
...  

Mapping short reads to a reference genome is an essential step in many next- generation sequencing (NGS) analysis. In plants with large genomes, a large fraction of the reads can align to multiple locations of the genome with equally good alignment scores. How to map these ambiguous reads to the genome is a challenging problem with big impacts in the downstream analysis. Traditionally, the default method is to assign an ambiguous read randomly to one of the many potential locations. In this study, we explore an enrichment method that assigns an ambiguous read to the location that has produced the most reads among all the potential locations. Our results show that the enrichment method produced better results than the default random method in the discovery of single nucleotide polymorphisms (SNPs). Not only did it produce more SNP markers, but it also produced markers with better quality, which was demonstrated by higher trait-marker correlation in genome-wide association studies (GWAS).


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Javed Akhatar ◽  
Anna Goyal ◽  
Navneet Kaur ◽  
Chhaya Atri ◽  
Meenakshi Mittal ◽  
...  

AbstractTimely transition to flowering, maturity and plant height are important for agronomic adaptation and productivity of Indian mustard (B. juncea), which is a major edible oilseed crop of low input ecologies in Indian subcontinent. Breeding manipulation for these traits is difficult because of the involvement of multiple interacting genetic and environmental factors. Here, we report a genetic analysis of these traits using a population comprising 92 diverse genotypes of mustard. These genotypes were evaluated under deficient (N75), normal (N100) or excess (N125) conditions of nitrogen (N) application. Lower N availability induced early flowering and maturity in most genotypes, while high N conditions delayed both. A genotyping-by-sequencing approach helped to identify 406,888 SNP markers and undertake genome wide association studies (GWAS). 282 significant marker-trait associations (MTA's) were identified. We detected strong interactions between GWAS loci and nitrogen levels. Though some trait associated SNPs were detected repeatedly across fertility gradients, majority were identified under deficient or normal levels of N applications. Annotation of the genomic region (s) within ± 50 kb of the peak SNPs facilitated prediction of 30 candidate genes belonging to light perception, circadian, floral meristem identity, flowering regulation, gibberellic acid pathways and plant development. These included over one copy each of AGL24, AP1, FVE, FRI, GID1A and GNC. FLC and CO were predicted on chromosomes A02 and B08 respectively. CDF1, CO, FLC, AGL24, GNC and FAF2 appeared to influence the variation for plant height. Our findings may help in improving phenotypic plasticity of mustard across fertility gradients through marker-assisted breeding strategies.


2021 ◽  
Vol 12 ◽  
Author(s):  
Valeria Orrù ◽  
Maristella Steri ◽  
Francesco Cucca ◽  
Edoardo Fiorillo

In recent years, systematic genome-wide association studies of quantitative immune cell traits, represented by circulating levels of cell subtypes established by flow cytometry, have revealed numerous association signals, a large fraction of which overlap perfectly with genetic signals associated with autoimmune diseases. By identifying further overlaps with association signals influencing gene expression and cell surface protein levels, it has also been possible, in several cases, to identify causal genes and infer candidate proteins affecting immune cell traits linked to autoimmune disease risk. Overall, these results provide a more detailed picture of how genetic variation affects the human immune system and autoimmune disease risk. They also highlight druggable proteins in the pathogenesis of autoimmune diseases; predict the efficacy and side effects of existing therapies; provide new indications for use for some of them; and optimize the research and development of new, more effective and safer treatments for autoimmune diseases. Here we review the genetic-driven approach that couples systematic multi-parametric flow cytometry with high-resolution genetics and transcriptomics to identify endophenotypes of autoimmune diseases for the development of new therapies.


2021 ◽  
Author(s):  
Dev Paudel ◽  
Rocheteau Dareus ◽  
Julia Rosenwald ◽  
Maria Munoz-Amatriain ◽  
Esteban Rios

Cowpea (Vigna unguiculata [L.] Walp., diploid, 2n = 22) is a major crop used as a protein source for human consumption as well as a quality feed for livestock. It is drought and heat tolerant and has been bred to develop varieties that are resilient to changing climates. Plant adaptation to new climates and their yield are strongly affected by flowering time. Therefore, understanding the genetic basis of flowering time is critical to advance cowpea breeding. The aim of this study was to perform genome-wide association studies (GWAS) to identify marker trait associations for flowering time in cowpea using single nucleotide polymorphism (SNP) markers. A total of 367 accessions from a cowpea mini-core collection were evaluated in Ft. Collins, CO in 2019 and 2020, and 292 accessions were evaluated in Citra, FL in 2018. These accessions were genotyped using the Cowpea iSelect Consortium Array that contained 51,128 SNPs. GWAS revealed seven reliable SNPs for flowering time that explained 8-12% of the phenotypic variance. Candidate genes including FT, GI, CRY2, LSH3, UGT87A2, LIF2, and HTA9 that are associated with flowering time were identified for the significant SNP markers. Further efforts to validate these loci will help to understand their role in flowering time in cowpea, and it could facilitate the transfer of some of this knowledge to other closely related legume species.


2020 ◽  
Vol 18 (3) ◽  
pp. 111-119
Author(s):  
Yinghu Zhang ◽  
Haiye Luan ◽  
Hui Zang ◽  
Hongyan Yang ◽  
Xiao Xu ◽  
...  

AbstractStarch content is an important trait in barley. To evaluate the genetic diversity and identify molecular markers of starch content in barley, 40 cultivated barley genotypes collected from different regions, including genotypes whose starch content is at either the high or low end of the spectrum (15), were used in this study. All the genotypes were re-sequenced by the double-digest-restriction associated DNA sequencing method, and a total of 299,103 single-nucleotide polymorphism (SNP) markers were obtained. The genotypes were divided into four sub-populations based on FASTSTRUCTURE, principal component analysis and neighbour-joining tree analysis. All four sub-populations had a high linkage disequilibrium, especially group 3, whose members were recently bred for malting in the Jiangsu coastal area. The starch content of the barley lines was evaluated during three growing seasons (2014–2017), and the average values of starch content across the three growing seasons at the low and high ends were 51.5 and 55.0%, respectively. The starch content was affected by population structure, the barley in group 2 had a low starch content, while the barley in group 4 had a high starch content. Twenty-six SNP markers were identified as being significantly associated with starch content (P ⩽ 0.001) based on the average values across the three growing seasons using the mixed linear model method. These SNP markers were located on chromosomes 1H and 4H, and were considered loci of qSC1-1 and qSC4-1, respectively. The major identified QTLs for starch content are helpful for further research on carbohydrates and for barley breeding.


2020 ◽  
Vol 79 (11) ◽  
pp. 1438-1445
Author(s):  
Young-Chang Kwon ◽  
Jiwoo Lim ◽  
So-Young Bang ◽  
Eunji Ha ◽  
Mi Yeong Hwang ◽  
...  

ObjectiveGenome-wide association studies (GWAS) in rheumatoid arthritis (RA) have discovered over 100 RA loci, explaining patient-relevant RA pathogenesis but showing a large fraction of missing heritability. As a continuous effort, we conducted GWAS in a large Korean RA case–control population.MethodsWe newly generated genome-wide variant data in two independent Korean cohorts comprising 4068 RA cases and 36 487 controls, followed by a whole-genome imputation and a meta-analysis of the disease association results in the two cohorts. By integrating publicly available omics data with the GWAS results, a series of bioinformatic analyses were conducted to prioritise the RA-risk genes in RA loci and to dissect biological mechanisms underlying disease associations.ResultsWe identified six new RA-risk loci (SLAMF6, CXCL13, SWAP70, NFKBIA, ZFP36L1 and LINC00158) with pmeta<5×10−8 and consistent disease effect sizes in the two cohorts. A total of 122 genes were prioritised from the 6 novel and 13 replicated RA loci based on physical distance, regulatory variants and chromatin interaction. Bioinformatics analyses highlighted potentially RA-relevant tissues (including immune tissues, lung and small intestine) with tissue-specific expression of RA-associated genes and suggested the immune-related gene sets (such as CD40 pathway, IL-21-mediated pathway and citrullination) and the risk-allele sharing with other diseases.ConclusionThis study identified six new RA-associated loci that contributed to better understanding of the genetic aetiology and biology in RA.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Fernando P. Guerra ◽  
Haktan Suren ◽  
Jason Holliday ◽  
James H. Richards ◽  
Oliver Fiehn ◽  
...  

Abstract Background Populus trichocarpa is an important forest tree species for the generation of lignocellulosic ethanol. Understanding the genomic basis of biomass production and chemical composition of wood is fundamental in supporting genetic improvement programs. Considerable variation has been observed in this species for complex traits related to growth, phenology, ecophysiology and wood chemistry. Those traits are influenced by both polygenic control and environmental effects, and their genome architecture and regulation are only partially understood. Genome wide association studies (GWAS) represent an approach to advance that aim using thousands of single nucleotide polymorphisms (SNPs). Genotyping using exome capture methodologies represent an efficient approach to identify specific functional regions of genomes underlying phenotypic variation. Results We identified 813 K SNPs, which were utilized for genotyping 461 P. trichocarpa clones, representing 101 provenances collected from Oregon and Washington, and established in California. A GWAS performed on 20 traits, considering single SNP-marker tests identified a variable number of significant SNPs (p-value < 6.1479E-8) in association with diameter, height, leaf carbon and nitrogen contents, and δ15N. The number of significant SNPs ranged from 2 to 220 per trait. Additionally, multiple-marker analyses by sliding-windows tests detected between 6 and 192 significant windows for the analyzed traits. The significant SNPs resided within genes that encode proteins belonging to different functional classes as such protein synthesis, energy/metabolism and DNA/RNA metabolism, among others. Conclusions SNP-markers within genes associated with traits of importance for biomass production were detected. They contribute to characterize the genomic architecture of P. trichocarpa biomass required to support the development and application of marker breeding technologies.


2019 ◽  
Vol 35 (22) ◽  
pp. 4724-4729 ◽  
Author(s):  
Wujuan Zhong ◽  
Cassandra N Spracklen ◽  
Karen L Mohlke ◽  
Xiaojing Zheng ◽  
Jason Fine ◽  
...  

Abstract Summary Tens of thousands of reproducibly identified GWAS (Genome-Wide Association Studies) variants, with the vast majority falling in non-coding regions resulting in no eventual protein products, call urgently for mechanistic interpretations. Although numerous methods exist, there are few, if any methods, for simultaneously testing the mediation effects of multiple correlated SNPs via some mediator (e.g. the expression of a gene in the neighborhood) on phenotypic outcome. We propose multi-SNP mediation intersection-union test (SMUT) to fill in this methodological gap. Our extensive simulations demonstrate the validity of SMUT as well as substantial, up to 92%, power gains over alternative methods. In addition, SMUT confirmed known mediators in a real dataset of Finns for plasma adiponectin level, which were missed by many alternative methods. We believe SMUT will become a useful tool to generate mechanistic hypotheses underlying GWAS variants, facilitating functional follow-up. Availability and implementation The R package SMUT is publicly available from CRAN at https://CRAN.R-project.org/package=SMUT. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Qing Cheng ◽  
Yi Yang ◽  
Xingjie Shi ◽  
Kar-Fu Yeung ◽  
Can Yang ◽  
...  

Abstract The proliferation of genome-wide association studies (GWAS) has prompted the use of two-sample Mendelian randomization (MR) with genetic variants as instrumental variables (IVs) for drawing reliable causal relationships between health risk factors and disease outcomes. However, the unique features of GWAS demand that MR methods account for both linkage disequilibrium (LD) and ubiquitously existing horizontal pleiotropy among complex traits, which is the phenomenon wherein a variant affects the outcome through mechanisms other than exclusively through the exposure. Therefore, statistical methods that fail to consider LD and horizontal pleiotropy can lead to biased estimates and false-positive causal relationships. To overcome these limitations, we proposed a probabilistic model for MR analysis in identifying the causal effects between risk factors and disease outcomes using GWAS summary statistics in the presence of LD and to properly account for horizontal pleiotropy among genetic variants (MR-LDP) and develop a computationally efficient algorithm to make the causal inference. We then conducted comprehensive simulation studies to demonstrate the advantages of MR-LDP over the existing methods. Moreover, we used two real exposure–outcome pairs to validate the results from MR-LDP compared with alternative methods, showing that our method is more efficient in using all-instrumental variants in LD. By further applying MR-LDP to lipid traits and body mass index (BMI) as risk factors for complex diseases, we identified multiple pairs of significant causal relationships, including a protective effect of high-density lipoprotein cholesterol on peripheral vascular disease and a positive causal effect of BMI on hemorrhoids.


2019 ◽  
Vol 25 (10) ◽  
pp. 2455-2467 ◽  
Author(s):  
Tim B. Bigdeli ◽  
◽  
Giulio Genovese ◽  
Penelope Georgakopoulos ◽  
Jacquelyn L. Meyers ◽  
...  

Abstract Schizophrenia is a common, chronic and debilitating neuropsychiatric syndrome affecting tens of millions of individuals worldwide. While rare genetic variants play a role in the etiology of schizophrenia, most of the currently explained liability is within common variation, suggesting that variation predating the human diaspora out of Africa harbors a large fraction of the common variant attributable heritability. However, common variant association studies in schizophrenia have concentrated mainly on cohorts of European descent. We describe genome-wide association studies of 6152 cases and 3918 controls of admixed African ancestry, and of 1234 cases and 3090 controls of Latino ancestry, representing the largest such study in these populations to date. Combining results from the samples with African ancestry with summary statistics from the Psychiatric Genomics Consortium (PGC) study of schizophrenia yielded seven newly genome-wide significant loci, and we identified an additional eight loci by incorporating the results from samples with Latino ancestry. Leveraging population differences in patterns of linkage disequilibrium, we achieve improved fine-mapping resolution at 22 previously reported and 4 newly significant loci. Polygenic risk score profiling revealed improved prediction based on trans-ancestry meta-analysis results for admixed African (Nagelkerke’s R2 = 0.032; liability R2 = 0.017; P < 10−52), Latino (Nagelkerke’s R2 = 0.089; liability R2 = 0.021; P < 10−58), and European individuals (Nagelkerke’s R2 = 0.089; liability R2 = 0.037; P < 10−113), further highlighting the advantages of incorporating data from diverse human populations.


Sign in / Sign up

Export Citation Format

Share Document