scholarly journals Machine learning-based detection of insertions and deletions in the human genome

2019 ◽  
Author(s):  
Charles Curnin ◽  
Rachel L. Goldfeder ◽  
Shruti Marwaha ◽  
Devon Bonner ◽  
Daryl Waggott ◽  
...  

AbstractInsertions and deletions (indels) make a critical contribution to human genetic variation. While indel calling has improved significantly, it lags dramatically in performance relative to single-nucleotide variant calling, something of particular concern for clinical genomics where larger scale disruption of the open reading frame can commonly cause disease. Here, we present a machine learning-based approach to the detection of indel breakpoints called Scotch. This novel approach improves sensitivity to larger variants dramatically by leveraging sequencing metrics and signatures of poor read alignment. We also introduce a meta-analytic indel caller, called Metal, that performs a “smart intersection” of Scotch and currently available tools to be maximally sensitive to large variants. We use new benchmark datasets and Sanger sequencing to compare Scotch and Metal to current gold standard indel callers, achieving unprecedented levels of precision and recall. We demonstrate the impact of these improvements by applying this tool to a cohort of patients with undiagnosed disease, generating plausible novel candidates in 21 out of 26 undiagnosed cases. We highlight the diagnosis of one patient with a 498-bp deletion in HNRNPA1 missed by traditional indel-detection tools.

2019 ◽  
Vol 20 (12) ◽  
pp. 2962 ◽  
Author(s):  
Kumaraswamy Naidu Chitrala ◽  
Mitzi Nagarkatti ◽  
Prakash Nagarkatti ◽  
Suneetha Yeguvapalli

Breast cancer is a leading cancer type and one of the major health issues faced by women around the world. Some of its major risk factors include body mass index, hormone replacement therapy, family history and germline mutations. Of these risk factors, estrogen levels play a crucial role. Among the estrogen receptors, estrogen receptor alpha (ERα) is known to interact with tumor suppressor protein p53 directly thereby repressing its function. Previously, we have studied the impact of deleterious breast cancer-associated non-synonymous single nucleotide polymorphisms (nsnps) rs11540654 (R110P), rs17849781 (P278A) and rs28934874 (P151T) in TP53 gene on the p53 DNA-binding core domain. In the present study, we aimed to analyze the impact of these mutations on p53–ERα interaction. To this end, we, have modelled the full-length structure of human p53 and validated its quality using PROCHECK and subjected it to energy minimization using NOMAD-Ref web server. Three-dimensional structure of ERα activation function-2 (AF-2) domain was downloaded from the protein data bank. Interactions between the modelled native and mutant (R110P, P278A, P151T) p53 with ERα was studied using ZDOCK. Machine learning predictions on the interactions were performed using Weka software. Results from the protein–protein docking showed that the atoms, residues and solvent accessibility surface area (SASA) at the interface was increased in both p53 and ERα for R110P mutation compared to the native complexes indicating that the mutation R110P has more impact on the p53–ERα interaction compared to the other two mutants. Mutations P151T and P278A, on the other hand, showed a large deviation from the native p53-ERα complex in atoms and residues at the surface. Further, results from artificial neural network analysis showed that these structural features are important for predicting the impact of these three mutations on p53–ERα interaction. Overall, these three mutations showed a large deviation in total SASA in both p53 and ERα. In conclusion, results from our study will be crucial in making the decisions for hormone-based therapies against breast cancer.


2019 ◽  
Vol 2019 (2) ◽  
pp. 47-65
Author(s):  
Balázs Pejó ◽  
Qiang Tang ◽  
Gergely Biczók

Abstract Machine learning algorithms have reached mainstream status and are widely deployed in many applications. The accuracy of such algorithms depends significantly on the size of the underlying training dataset; in reality a small or medium sized organization often does not have the necessary data to train a reasonably accurate model. For such organizations, a realistic solution is to train their machine learning models based on their joint dataset (which is a union of the individual ones). Unfortunately, privacy concerns prevent them from straightforwardly doing so. While a number of privacy-preserving solutions exist for collaborating organizations to securely aggregate the parameters in the process of training the models, we are not aware of any work that provides a rational framework for the participants to precisely balance the privacy loss and accuracy gain in their collaboration. In this paper, by focusing on a two-player setting, we model the collaborative training process as a two-player game where each player aims to achieve higher accuracy while preserving the privacy of its own dataset. We introduce the notion of Price of Privacy, a novel approach for measuring the impact of privacy protection on the accuracy in the proposed framework. Furthermore, we develop a game-theoretical model for different player types, and then either find or prove the existence of a Nash Equilibrium with regard to the strength of privacy protection for each player. Using recommendation systems as our main use case, we demonstrate how two players can make practical use of the proposed theoretical framework, including setting up the parameters and approximating the non-trivial Nash Equilibrium.


Author(s):  
Venkateswarlu Naik Midde ◽  
Vasumathi D ◽  
A.P. Siva Kumar

Introduction: Extraction of distinguishing semantic level emotions posed in multi-languages over social media is an essential task in the field of sentiment analysis or opinion mining. The extraction of emotions expressed in Dravidian or local languages combining with multi-languages over social media has become an essential challenge in the field of big data sentiment analysis. Methods: In the proposed approach, an innovative framework to recognize the sentiments of users in multi-languages or Dravidian languages text data using scientific linguistic theories has been defined. The proposed method used machine learning techniques such as naïve Bayes, support vector machine for fine-grained classification of multilingual text with help of lexicon-based features groups. Results: The results obtained by the experiments conducted on collected benchmark datasets in the proposed approach are outperformed and better in comparison with corpus-based and world level, phrase-level sentiment analysis for multilanguages text. Conclusion: Machine learning technnique SVM has outperformed for sentiment and emotion extraction.


2020 ◽  
Author(s):  
Abhijit Gupta ◽  
Mandar Kulkarni ◽  
Arnab Mukherjee

<div> <div> <div> <p>DNA carries the genetic code of life. Different conformations of DNA are associated with various biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. Although a few efforts were made in this regard, currently there exists no method that can accurately predict the conformation of right- handed DNA solely from the sequence. In this study, we present a novel approach based on machine learning that predicts A-DNA and B-DNA conformational propensities of a sequence with high accuracy (~95%). In addition, we show that the impact of the dinucleotide steps in determining the conformation agrees qualitatively with the free energy cost for A-DNA formation in water. This method enables us to examine the genomic sequence to understand the prospective biological roles played by the A-form of DNA. </p> </div> </div> </div>


2020 ◽  
Author(s):  
Abhijit Gupta ◽  
Mandar Kulkarni ◽  
Arnab Mukherjee

<div> <div> <p>DNA carries the genetic code of life. Different conformations of DNA are associated with various biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. Although a few efforts were made in this regard, currently there exists no method that can accurately predict the conformation of right-handed DNA solely from the sequence. In this study, we present a novel approach based on machine learning that predicts A-DNA and B-DNA conformational propensities of a sequence with high accuracy (~<a>93</a>%). In addition, we show that the impact of the dinucleotide steps in determining the conformation agrees qualitatively with the free energy cost for A-DNA formation in water. We are hopeful that our methodology can be employed on segments of the genomic sequence to understand the prospective biological roles played by the A-form of DNA.</p><p> </p><div> <br><div><div> </div> </div> </div> </div> </div>


Author(s):  
Dennis Collaris ◽  
Jarke J. van Wijk

Abstract The field of explainable artificial intelligence aims to help experts understand complex machine learning models. One key approach is to show the impact of a feature on the model prediction. This helps experts to verify and validate the predictions the model provides. However, many challenges remain open. For example, due to the subjective nature of interpretability, a strict definition of concepts such as the contribution of a feature remains elusive. Different techniques have varying underlying assumptions, which can cause inconsistent and conflicting views. In this work, we introduce local and global contribution-value plots as a novel approach to visualize feature impact on predictions and the relationship with feature value. We discuss design decisions and show an exemplary visual analytics implementation that provides new insights into the model. We conducted a user study and found the visualizations aid model interpretation by increasing correctness and confidence and reducing the time taken to obtain an insight. Graphic Abstract


2020 ◽  
Author(s):  
Abhijit Gupta ◽  
Mandar Kulkarni ◽  
Arnab Mukherjee

<div> <div> <p>DNA carries the genetic code of life. Different conformations of DNA are associated with various biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. Although a few efforts were made in this regard, currently there exists no method that can accurately predict the conformation of right-handed DNA solely from the sequence. In this study, we present a novel approach based on machine learning that predicts A-DNA and B-DNA conformational propensities of a sequence with high accuracy (~<a>93</a>%). In addition, we show that the impact of the dinucleotide steps in determining the conformation agrees qualitatively with the free energy cost for A-DNA formation in water. We are hopeful that our methodology can be employed on segments of the genomic sequence to understand the prospective biological roles played by the A-form of DNA.</p><p> </p><div> <br><div><div> </div> </div> </div> </div> </div>


2020 ◽  
Author(s):  
Abhijit Gupta ◽  
Mandar Kulkarni ◽  
Arnab Mukherjee

<div> <div> <div> <p>DNA carries the genetic code of life. Different conformations of DNA are associated with various biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. Although a few efforts were made in this regard, currently there exists no method that can accurately predict the conformation of right- handed DNA solely from the sequence. In this study, we present a novel approach based on machine learning that predicts A-DNA and B-DNA conformational propensities of a sequence with high accuracy (~95%). In addition, we show that the impact of the dinucleotide steps in determining the conformation agrees qualitatively with the free energy cost for A-DNA formation in water. This method enables us to examine the genomic sequence to understand the prospective biological roles played by the A-form of DNA. </p> </div> </div> </div>


2020 ◽  
Vol 6 (12) ◽  
Author(s):  
Stephen J. Bush

Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, fastp, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, fastp. To extend these findings, >6500 publicly archived sequencing datasets from Escherichia coli , Mycobacterium tuberculosis and Staphylococcus aureus were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where <100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.


2019 ◽  
Author(s):  
Mattia Bosio ◽  
Alfonso Valencia ◽  
Salvador Capella-Gutierrez

AbstractBackgroundTranscriptomics data, often referred as RNA-Seq, are increasingly being adopted in clinical practice due to the opportunity to answer several questions with the same data - e.g. gene expression, splicing, allele-specific expression even without matching DNA. Indeed, recent studies showed how RNA-Seq can contribute to decipher the impact of germline variants. These efforts allowed to dramatically improved the diagnostic yield in specific rare disease patient cohorts. Nevertheless, RNA-Seq is not routinely adopted for germline variant calling in the clinic. This is mostly due to a combination of technical noise and biological processes that affect the reliability of results, and are difficult to reduce using standard filtering strategies.ResultsTo provide reliable germline variant calling from RNA-Seq for clinical use, such as for mendelian diseases diagnosis, we developed SmartRNASeqCaller: a Machine Learning system focused to reduce the burden of false positive calls from RNA-Seq. Thanks to the availability of large amount of high quality data, we could comprehensively train SmartRNASeqCaller using a suitable features set to characterize each potential variant.The model integrates information from multiple sources, capturing variant-specific characteristics, contextual information, and external sources of annotation. We tested our tool against state-of-the-art workflows on a set of 376 independent validation samples from GIAB, Neuromics, and GTEx consortia. SmartRNASeqCaller remarkably increases precision of RNA-Seq germline variant calls, reducing the false positive burden by 50% without strong impact on sensitivity. This translates to an average precision increase of 20.9%, showing a consistent effect on samples from different origins and characteristics.ConclusionsSmartRNASeqCaller shows that a general strategy adopted in different areas of applied machine learning can be exploited to improve variant calling. Switching from a naïve hard-filtering schema to a more powerful, data-driven solution enabled a qualitative and quantitative improvement in terms of precision/recall performances. This is key for the intended use of SmartRNASeqCaller within clinical settings to identify disease-causing variants.


Sign in / Sign up

Export Citation Format

Share Document