GA-Based Data Mining Applied to Genetic Data for the Diagnosis of Complex Diseases

Author(s):  
Vanessa Aguiar ◽  
Jose A. Seoane ◽  
Ana Freire ◽  
Ling Guo

A new algorithm is presented for finding genotype-phenotype association rules from data related to complex diseases. The algorithm was based on genetic algorithms, a technique of evolutionary computation. The algorithm was compared to several traditional data mining techniques and it was proved that it obtained better classification scores and found more rules from the data generated artificially. It also obtained similar results when using some UCI Machine Learning datasets. In this chapter it is assumed that several groups of Single Nucleotide Polymorphisms (SNPs) have an impact on the predisposition to develop a complex disease like schizophrenia. It is expected to validate this in a short period of time on real data.

Author(s):  
Marwa M. Abd El Hamid ◽  
Mohamed Shaheen ◽  
Mai S. Mabrouk ◽  
Yasser M. K. Omar

Alzheimer’s disease (AD) is a progressive disease that attacks the brain’s neurons and causes problems in memory, thinking, and reasoning skills. Personalized Medicine (PM) needs a better and more accurate understanding of the relationship between human genetic data and complex diseases like AD. The goal of PM is to tailor the treatment of a case person to his individual properties. PM requires the prediction of a person’s disease from genetic data, and its success depends on the accurate detection of genetic biomarkers. Single Nucleotide polymorphisms (SNPs) are considered the most prevalent type of variation in the human genome. Epistasis has a biological relevance to complex diseases and has an important impact on PM. Detection of the most significant epistasis interactions associated with complex diseases is a big challenge. This paper reviews several machine learning techniques and algorithms to detect the most significant epistasis interactions in Alzheimer’s disease. We discuss many machine learning techniques that can be used for detecting SNPs’ combinations like Random Forests, Support Vector Machines, Multifactor Dimensionality Reduction, Neural Network, and Deep Learning. This review paper highlights the pros and cons of these techniques and explains how they can be applied in an efficient framework to apply knowledge discovery and data mining in AD disease.


2018 ◽  
Author(s):  
Walid Korani ◽  
Josh P. Clevenger ◽  
Ye Chu ◽  
Peggy Ozias-Akins

AbstractSingle Nucleotide Polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and co-dominant. However, the discovery of true SNPs especially in polyploid species is difficult. Peanut is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Arachis 58k Affymetrix array was leveraged to train machine learning models to select true SNPs straight from sequence data. These models achieved accuracy rates of above 80% using real peanut RNA-seq and whole genome shotgun (WGS) re-sequencing data, which is higher than previously reported for polyploids. A 48K SNP array, Axiom Arachis2, was designed using the approach which revealed 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in peanut, cotton, wheat, and strawberry, we show that models built with our parameter sets achieve above 98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at above 80% accuracy using real peanut data, demonstrating that our model can be used even if real data are not available to train the models. This work demonstrates an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP-ML (SNP-Machine Learning, pronounced “snip mill”), using the described models. SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP-MLer (SNP-Machine Learner, pronounced “snip miller”). SNP-ML is freely available for public use.


2021 ◽  
Author(s):  
Thomas K. F. Wong ◽  
Teng Li ◽  
Louis Ranjard ◽  
Steven Wu ◽  
Jeet Sukumaran ◽  
...  

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.


Author(s):  
J. Hertzberg ◽  
S. Mundlos ◽  
M. Vingron ◽  
G. Gallone

AbstractThe computational prediction of disease-associated genetic variation is of fundamental importance for the genomics, genetics and clinical research communities. Whereas the mechanisms and disease impact underlying coding single nucleotide polymorphisms (SNPs) and small Insertions/Deletions (InDels) have been the focus of intense study, little is known about the corresponding impact of structural variants (SVs), which are challenging to detect, phase and interpret. Few methods have been developed to prioritise larger chromosomal alterations such as Copy Number Variants (CNVs) based on their pathogenicity. We address this issue with TADA, a method to prioritise pathogenic CNVs through manual filtering and automated classification, based on an extensive catalogue of functional annotation supported by rigorous enrichment analysis. We demonstrate that our machine-learning classifiers for deletions and duplications are able to accurately predict pathogenic CNVs (AUC: 0.8042 and 0.7869, respectively) and produce a well-calibrated pathogenicity score. The combination of enrichment analysis and classifications suggests that prioritisation of pathogenic CNVs based on functional annotation is a promising approach to support clinical diagnostic and to further the understanding of mechanisms that control the disease impact of larger genomic alterations.


2007 ◽  
Vol 10 (6) ◽  
pp. 871-885 ◽  
Author(s):  
An Windelinckx ◽  
Robert Vlietinck ◽  
Jeroen Aerssens ◽  
Gaston Beunen ◽  
Martine A. I. Thomis

AbstractFine mapping of linkage peaks is one of the great challenges facing researchers who try to identify genes and genetic variants responsible for the variation in a certain trait or complex disease. Once the trait is linked to a certain chromosomal region, most studies use a candidate gene approach followed by a selection of polymorphisms within these genes, either based on their possibility to be functional, or based on the linkage disequilibrium between adjacent markers. For both candidate gene selection and SNP selection, several approaches have been described, and different software tools are available. However, mastering all these information sources and choosing between the different approaches can be difficult and time-consuming. Therefore, this article lists several of these in silico procedures, and the authors describe an empirical two-step fine mapping approach, in which candidate genes are prioritized using a bioinformatics approach (ENDEAVOUR), and the top genes are chosen for further SNP selection with a linkage disequilibrium based method (Tagger). The authors present the different actions that were applied within this approach on two previously identified linkage regions for muscle strength. This resulted in the selection of 331 polymorphisms located in 112 different candidate genes out of an initial set of 23,300 SNPs.


2018 ◽  
Author(s):  
Brian S. Helfer ◽  
Darrell O. Ricke

AbstractHigh throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) provides additional applications for DNA forensics including identification, mixture analysis, kinship prediction, and biogeographic ancestry prediction. Public repositories of human genetic data are being rapidly generated and released, but the majorities of these samples are de-identified to protect privacy, and have little or no individual metadata such as appearance (photos), ethnicity, relatives, etc. A reference in silico dataset has been generated to enable development and testing of new DNA forensics algorithms. This dataset provides 11 million SNP profiles for individuals with defined ethnicities and family relationships spanning eight generations with admixture for a panel with 39,108 SNPs.


2019 ◽  
Vol 20 (12) ◽  
pp. 2962 ◽  
Author(s):  
Kumaraswamy Naidu Chitrala ◽  
Mitzi Nagarkatti ◽  
Prakash Nagarkatti ◽  
Suneetha Yeguvapalli

Breast cancer is a leading cancer type and one of the major health issues faced by women around the world. Some of its major risk factors include body mass index, hormone replacement therapy, family history and germline mutations. Of these risk factors, estrogen levels play a crucial role. Among the estrogen receptors, estrogen receptor alpha (ERα) is known to interact with tumor suppressor protein p53 directly thereby repressing its function. Previously, we have studied the impact of deleterious breast cancer-associated non-synonymous single nucleotide polymorphisms (nsnps) rs11540654 (R110P), rs17849781 (P278A) and rs28934874 (P151T) in TP53 gene on the p53 DNA-binding core domain. In the present study, we aimed to analyze the impact of these mutations on p53–ERα interaction. To this end, we, have modelled the full-length structure of human p53 and validated its quality using PROCHECK and subjected it to energy minimization using NOMAD-Ref web server. Three-dimensional structure of ERα activation function-2 (AF-2) domain was downloaded from the protein data bank. Interactions between the modelled native and mutant (R110P, P278A, P151T) p53 with ERα was studied using ZDOCK. Machine learning predictions on the interactions were performed using Weka software. Results from the protein–protein docking showed that the atoms, residues and solvent accessibility surface area (SASA) at the interface was increased in both p53 and ERα for R110P mutation compared to the native complexes indicating that the mutation R110P has more impact on the p53–ERα interaction compared to the other two mutants. Mutations P151T and P278A, on the other hand, showed a large deviation from the native p53-ERα complex in atoms and residues at the surface. Further, results from artificial neural network analysis showed that these structural features are important for predicting the impact of these three mutations on p53–ERα interaction. Overall, these three mutations showed a large deviation in total SASA in both p53 and ERα. In conclusion, results from our study will be crucial in making the decisions for hormone-based therapies against breast cancer.


2019 ◽  
Vol 15 (4) ◽  
pp. 277-283 ◽  
Author(s):  
Junyan Li ◽  
Xiaohong Niu ◽  
JianBo Li ◽  
Qingzhong Wang

Background:Previous studies suggested that the single nucleotide polymorphisms of Pro12Ala located within the PPARG gene were significantly associated with the T2DM. Recently, the genetic studies on Pro12Ala were conducted in the different ethnic groups and the results of each study were shown to be inconsistent. Moreover, the systematic review has not been updated since 2000.Objective:To further validate the risk of Pro12Ala for T2DM disease based on the genetic data.Methods:The genetic studies on the Pro12Ala in the T2DM were searched in the PubMed and PMC database from January 2000 to October 2017. The meta-analysis was conducted with the CMA software.Results:The meta-analysis collected 14 studies including 20702 cases and 36227 controls. The combined analysis of all studies found that Pro12Ala was shown to be significantly associated with T2DM and the Ala allele played the increasing risks for the disease. Nevertheless, publication bias was detected in the combined analysis. The subgroup analysis indicated that Pro12Ala was found to be significant in the Caucasian and Chinese population. There was no heterogeneity and publication bias in these two groups.Conclusion:The meta-analysis confirmed the evidence that the Pro12Ala was the susceptible variant for the decreasing risks for the T2DM


2013 ◽  
Vol 380-384 ◽  
pp. 1469-1472
Author(s):  
Gui Jun Shan

Partition methods for real data play an extremely important role in decision tree algorithms in data mining and machine learning because the decision tree algorithms require that the values of attributes are discrete. In this paper, we propose a novel partition method for real data in decision tree using statistical criterion. This method constructs a statistical criterion to find accurate merging intervals. In addition, we present a heuristic partition algorithm to achieve a desired partition result with the aim to improve the performance of decision tree algorithms. Empirical experiments on UCI real data show that the new algorithm generates a better partition scheme that improves the classification accuracy of C4.5 decision tree than existing algorithms.


2014 ◽  
Vol 26 (2) ◽  
pp. 567-582 ◽  
Author(s):  
Zhongxue Chen ◽  
Hon Keung Tony Ng ◽  
Jing Li ◽  
Qingzhong Liu ◽  
Hanwen Huang

In the past decade, hundreds of genome-wide association studies have been conducted to detect the significant single-nucleotide polymorphisms that are associated with certain diseases. However, most of the data from the X chromosome were not analyzed and only a few significant associated single-nucleotide polymorphisms from the X chromosome have been identified from genome-wide association studies. This is mainly due to the lack of powerful statistical tests. In this paper, we propose a novel statistical approach that combines the information of single-nucleotide polymorphisms on the X chromosome from both males and females in an efficient way. The proposed approach avoids the need of making strong assumptions about the underlying genetic models. Our proposed statistical test is a robust method that only makes the assumption that the risk allele is the same for both females and males if the single-nucleotide polymorphism is associated with the disease for both genders. Through simulation study and a real data application, we show that the proposed procedure is robust and have excellent performance compared to existing methods. We expect that many more associated single-nucleotide polymorphisms on the X chromosome will be identified if the proposed approach is applied to current available genome-wide association studies data.


Sign in / Sign up

Export Citation Format

Share Document