scholarly journals QTG-Finder: a machine-learning based algorithm to prioritize causal genes of quantitative trait loci

2018 ◽  
Author(s):  
Fan Lin ◽  
Jue Fan ◽  
Seung Y. Rhee

AbstractLinkage mapping is one of the most commonly used methods to identify genetic loci that determine a trait. However, the loci identified by linkage mapping may contain hundreds of candidate genes and require a time-consuming and labor-intensive fine mapping process to find the causal gene controlling the trait. With the availability of a rich assortment of genomic and functional genomic data, it is possible to develop a computational method to facilitate faster identification of causal genes. We developed QTG-Finder, a machine learning based algorithm to prioritize causal genes by ranking genes within a quantitative trait locus (QTL). Two predictive models were trained separately based on known causal genes in Arabidopsis and rice. With an independent validation analysis, we demonstrate the models can correctly prioritize about 65% and 60% of Arabidopsis and rice causal genes when the top 20% ranked genes were considered. The models can prioritize different types of traits though at different efficiency. We also identified several important features of causal genes including paralog copy number, being a transporter, being a transcription factor, and containing SNPs that cause premature stop codon. This work lays the foundation for systematically understanding characteristics of causal genes and establishes a pipeline to predict causal genes based on public data.One sentence summaryWe systematically analyzed the genomic characteristics of causal genes in QTLs and developed a novel computational tool to prioritize causal genes.

2019 ◽  
Vol 9 (10) ◽  
pp. 3129-3138 ◽  
Author(s):  
Fan Lin ◽  
Jue Fan ◽  
Seung Y. Rhee

Linkage mapping is one of the most commonly used methods to identify genetic loci that determine a trait. However, the loci identified by linkage mapping may contain hundreds of candidate genes and require a time-consuming and labor-intensive fine mapping process to find the causal gene controlling the trait. With the availability of a rich assortment of genomic and functional genomic data, it is possible to develop a computational method to facilitate faster identification of causal genes. We developed QTG-Finder, a machine learning based algorithm to prioritize causal genes by ranking genes within a quantitative trait locus (QTL). Two predictive models were trained separately based on known causal genes in Arabidopsis and rice. An independent validation analysis showed that the models could recall about 64% of Arabidopsis and 79% of rice causal genes when the top 20% ranked genes were considered. The top 20% ranked genes can range from 10 to 100 genes, depending on the size of a QTL. The models can prioritize different types of traits though at different efficiency. We also identified several important features of causal genes including paralog copy number, being a transporter, being a transcription factor, and containing SNPs that cause premature stop codon. This work lays the foundation for systematically understanding characteristics of causal genes and establishes a pipeline to predict causal genes based on public data.


2020 ◽  
Vol 10 (7) ◽  
pp. 2411-2421
Author(s):  
Fan Lin ◽  
Elena Z. Lazarus ◽  
Seung Y. Rhee

Linkage mapping has been widely used to identify quantitative trait loci (QTL) in many plants and usually requires a time-consuming and labor-intensive fine mapping process to find the causal gene underlying the QTL. Previously, we described QTG-Finder, a machine-learning algorithm to rationally prioritize candidate causal genes in QTLs. While it showed good performance, QTG-Finder could only be used in Arabidopsis and rice because of the limited number of known causal genes in other species. Here we tested the feasibility of enabling QTG-Finder to work on species that have few or no known causal genes by using orthologs of known causal genes as the training set. The model trained with orthologs could recall about 64% of Arabidopsis and 83% of rice causal genes when the top 20% ranked genes were considered, which is similar to the performance of models trained with known causal genes. The average precision was 0.027 for Arabidopsis and 0.029 for rice. We further extended the algorithm to include polymorphisms in conserved non-coding sequences and gene presence/absence variation as additional features. Using this algorithm, QTG-Finder2, we trained and cross-validated Sorghum bicolor and Setaria viridis models. The S. bicolor model was validated by causal genes curated from the literature and could recall 70% of causal genes when the top 20% ranked genes were considered. In addition, we applied the S. viridis model and public transcriptome data to prioritize a plant height QTL and identified 13 candidate genes. QTL-Finder2 can accelerate the discovery of causal genes in any plant species and facilitate agricultural trait improvement.


2020 ◽  
Author(s):  
Fan Lin ◽  
Elena Z. Lazarus ◽  
Seung Y. Rhee

AbstractLinkage mapping has been widely used to identify quantitative trait loci (QTL) in many plants and usually requires a time-consuming and labor-intensive fine mapping process to find the causal gene underlying the QTL. Previously, we described QTG-Finder, a machine-learning algorithm to rationally prioritize candidate causal genes in QTLs. While it showed good performance, QTG-Finder could only be used in Arabidopsis and rice because of the limited number of known causal genes in other species. Here we tested the feasibility of enabling QTG-Finder to work on species that have few or no known causal genes by using orthologs of known causal genes as training set. The model trained with orthologs could recall about 64% of Arabidopsis and 83% of rice causal genes when the top 20% ranked genes were considered, which is similar to the performance of models trained with known causal genes. We further extended the algorithm to include polymorphisms in conserved non-coding sequences and gene presence/absence variation as additional features. Using this algorithm, QTG-Finder2, we trained and cross-validated Sorghum bicolor and Setaria viridis models. The S. bicolor model was validated by causal genes curated from the literature and could recall 70% of causal genes when the top 20% ranked genes were considered. In addition, we applied the S. viridis model and public transcriptome data to prioritize a plant height QTL and identified 13 candidate genes. QTL-Finder2 can accelerate the discovery of causal genes in any plant species and facilitate agricultural trait improvement.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Zheng Zeng ◽  
Yanzhou Wang ◽  
Chan Liu ◽  
Xiufeng Yang ◽  
Hengyun Wang ◽  
...  

AbstractRamie is an important natural fiber crop, and the fiber yield and its related traits are the most valuable traits in ramie production. However, the genetic basis for these traits is still poorly understood, which has dramatically hindered the breeding of high yield in this fiber crop. Herein, a high-density genetic map with 6,433 markers spanning 2476.5 cM was constructed using a population derived from two parents, cultivated ramie Zhongsizhu 1 (ZSZ1) and its wild progenitor B. nivea var. tenacissima (BNT). The fiber yield (FY) and its four related traits—stem diameter (SD) and length (SL), stem bark weight (BW) and thickness (BT)—were performed for quantitative trait locus (QTL) analysis, resulting in a total of 47 QTLs identified. Forty QTLs were mapped into 12 genomic regions, thus forming 12 QTL clusters. Among 47 QTLs, there were 14 QTLs whose wild allele from BNT was beneficial. Interestingly, all QTLs in Cluster 10 displayed overdominance, indicating that the region of this cluster was likely heterotic loci. In addition, four fiber yield-related genes underwent positive selection were found either to fall into the FY-related QTL regions or to be near to the identified QTLs. The dissection of FY and FY-related traits not only improved our understanding to the genetic basis of these traits, but also provided new insights into the domestication of FY in ramie. The identification of many QTLs and the discovery of beneficial alleles from wild species provided a basis for the improvement of yield traits in ramie breeding.


2020 ◽  
Vol 103 (1) ◽  
pp. 266-278
Author(s):  
Fanmiao Wang ◽  
Kenji Yano ◽  
Shiro Nagamatsu ◽  
Mayuko Inari‐Ikeda ◽  
Eriko Koketsu ◽  
...  

2002 ◽  
Vol 25 (4) ◽  
pp. 605-608 ◽  
Author(s):  
Zong Hu CUI ◽  
Kiyomitsu NEMOTO ◽  
Kohei KAWAKAMI ◽  
Tatsuo GONDA ◽  
Toru NABIKA ◽  
...  

2017 ◽  
Author(s):  
David Stacey ◽  
Eric B. Fauman ◽  
Daniel Ziemek ◽  
Benjamin B. Sun ◽  
Eric L. Harshfield ◽  
...  

ABSTRACTQuantitative trait locus (QTL) mapping of molecular phenotypes such as metabolites, lipids, and proteins through genome-wide association studies (GWAS) represents a powerful means of highlighting molecular mechanisms relevant to human diseases. However, a major challenge of this approach is to identify the causal gene(s) at the observed QTLs. Here we present a framework for the “Prioritisation of candidate causal Genes at Molecular QTLs” (ProGeM), which incorporates biological domain-specific annotation data alongside genome annotation data from multiple repositories. We assessed the performance of ProGeM using a reference set of 227 previously reported and extensively curated metabolite QTLs. For 98% of these loci, the expert-curated gene was one of the candidate causal genes prioritised by ProGeM. Benchmarking analyses revealed that 69% of the causal candidates were nearest to the sentinel variant at the investigated molecular QTLs, indicating that genomic proximity is the most reliable indicator of “true positive” causal genes. In contrast, cis-gene expression QTL data led to three false positive candidate causal gene assignments for every one true positive assignment. We provide evidence that these conclusions also apply to other molecular phenotypes, suggesting that ProGeM is a powerful and versatile tool for annotating molecular QTLs. ProGeM is freely available via GitHub.


2010 ◽  
Vol 92 (4) ◽  
pp. 273-281 ◽  
Author(s):  
J. HERNÁNDEZ-SÁNCHEZ ◽  
A. CHATZIPLI ◽  
D. BERALDI ◽  
J. GRATTEN ◽  
J. G. PILKINGTON ◽  
...  

SummaryHistorical information can be used, in addition to pedigree, traits and genotypes, to map quantitative trait locus (QTL) in general populations via maximum likelihood estimation of variance components. This analysis is known as linkage disequilibrium (LD) and linkage mapping, because it exploits both linkage in families and LD at the population level. The search for QTL in the wild population of Soay sheep on St. Kilda is a proof of principle. We analysed the data from a previous study and confirmed some of the QTLs reported. The most striking result was the confirmation of a QTL affecting birth weight that had been reported using association tests but not when using linkage-based analyses.


Sign in / Sign up

Export Citation Format

Share Document