scholarly journals SmartRNASeqCaller: improving germline variant calling from RNAseq

2019 ◽  
Author(s):  
Mattia Bosio ◽  
Alfonso Valencia ◽  
Salvador Capella-Gutierrez

AbstractBackgroundTranscriptomics data, often referred as RNA-Seq, are increasingly being adopted in clinical practice due to the opportunity to answer several questions with the same data - e.g. gene expression, splicing, allele-specific expression even without matching DNA. Indeed, recent studies showed how RNA-Seq can contribute to decipher the impact of germline variants. These efforts allowed to dramatically improved the diagnostic yield in specific rare disease patient cohorts. Nevertheless, RNA-Seq is not routinely adopted for germline variant calling in the clinic. This is mostly due to a combination of technical noise and biological processes that affect the reliability of results, and are difficult to reduce using standard filtering strategies.ResultsTo provide reliable germline variant calling from RNA-Seq for clinical use, such as for mendelian diseases diagnosis, we developed SmartRNASeqCaller: a Machine Learning system focused to reduce the burden of false positive calls from RNA-Seq. Thanks to the availability of large amount of high quality data, we could comprehensively train SmartRNASeqCaller using a suitable features set to characterize each potential variant.The model integrates information from multiple sources, capturing variant-specific characteristics, contextual information, and external sources of annotation. We tested our tool against state-of-the-art workflows on a set of 376 independent validation samples from GIAB, Neuromics, and GTEx consortia. SmartRNASeqCaller remarkably increases precision of RNA-Seq germline variant calls, reducing the false positive burden by 50% without strong impact on sensitivity. This translates to an average precision increase of 20.9%, showing a consistent effect on samples from different origins and characteristics.ConclusionsSmartRNASeqCaller shows that a general strategy adopted in different areas of applied machine learning can be exploited to improve variant calling. Switching from a naïve hard-filtering schema to a more powerful, data-driven solution enabled a qualitative and quantitative improvement in terms of precision/recall performances. This is key for the intended use of SmartRNASeqCaller within clinical settings to identify disease-causing variants.

2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Qingsong Xi ◽  
Qiyu Yang ◽  
Meng Wang ◽  
Bo Huang ◽  
Bo Zhang ◽  
...  

Abstract Background To minimize the rate of in vitro fertilization (IVF)- associated multiple-embryo gestation, significant efforts have been made. Previous studies related to machine learning in IVF mainly focused on selecting the top-quality embryos to improve outcomes, however, in patients with sub-optimal prognosis or with medium- or inferior-quality embryos, the selection between SET and DET could be perplexing. Methods This was an application study including 9211 patients with 10,076 embryos treated during 2016 to 2018, in Tongji Hospital, Wuhan, China. A hierarchical model was established using the machine learning system XGBoost, to learn embryo implantation potential and the impact of double embryos transfer (DET) simultaneously. The performance of the model was evaluated with the AUC of the ROC curve. Multiple regression analyses were also conducted on the 19 selected features to demonstrate the differences between feature importance for prediction and statistical relationship with outcomes. Results For a single embryo transfer (SET) pregnancy, the following variables remained significant: age, attempts at IVF, estradiol level on hCG day, and endometrial thickness. For DET pregnancy, age, attempts at IVF, endometrial thickness, and the newly added P1 + P2 remained significant. For DET twin risk, age, attempts at IVF, 2PN/ MII, and P1 × P2 remained significant. The algorithm was repeated 30 times, and averaged AUC of 0.7945, 0.8385, and 0.7229 were achieved for SET pregnancy, DET pregnancy, and DET twin risk, respectively. The trend of predictive and observed rates both in pregnancy and twin risk was basically identical. XGBoost outperformed the other two algorithms: logistic regression and classification and regression tree. Conclusion Artificial intelligence based on determinant-weighting analysis could offer an individualized embryo selection strategy for any given patient, and predict clinical pregnancy rate and twin risk, therefore optimizing clinical outcomes.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Gavin W. Wilson ◽  
Mathieu Derouet ◽  
Gail E. Darling ◽  
Jonathan C. Yeung

AbstractIdentifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments; however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.


Author(s):  
Bartosz Firlik ◽  
Maciej Tabaszewski

This paper presents the concept of a simple system for the identification of the technical condition of tracks based on a trained learning system in the form of three independent neural networks. The studies conducted showed that basic measurements based on the root mean square of vibration acceleration allow for monitoring the track condition provided that the rail type has been included in the information system. Also, it is necessary to select data based on the threshold value of the vehicle velocity. In higher velocity ranges (above 40 km/h), it is possible to distinguish technical conditions with a permissible error of 5%. Such selection also enables to ignore the impact of rides through switches and crossings. Technical condition monitoring is also possible at lower ride velocities; however, this comes at the cost of reduced accuracy of the analysis.


2021 ◽  
Author(s):  
Qingsong XI ◽  
Qiyu YANG ◽  
Meng WANG ◽  
Bo HUANG ◽  
Bo ZHANG ◽  
...  

Abstract Background: To minimize the rate of in vitro fertilization (IVF)- associated multiple-embryo gestation, significant efforts have been made. Previous studies related to machine learning in IVF mainly focused on selecting the top-quality embryos to improve outcomes, however, in patients with sub-optimal prognosis or with medium- or inferior-quality embryos, the selection between SET and DET could be perplexing. Methods: This was an application study including 7887 patients with 8585 embryos treated during 2016 to 2018, in Tongji Hospital, Wuhan, China. A hierarchical model was established using the machine learning system XGBoost, to learn embryo implantation potential and the impact of double embryos transfer (DET) simultaneously. The performance of the model was evaluated with the AUC of the ROC curve. Multiple regression analyses were also conducted on the 19 selected features to demonstrate the differences between feature importance for prediction and statistical relationship with outcomes.Results: For a single embryo transfer (SET) pregnancy, the following variables remained significant: age, attempts at IVF, estradiol level on hCG day, and endometrial thickness. For DET pregnancy, age, attempts at IVF, endometrial thickness, and the newly added P1+P2 remained significant. For DET twin risk, age, attempts at IVF, 2PN/ MII, and P1×P2 remained significant. The algorithm was repeated 30 times, and averaged AUC of 0.7945, 0.8385, and 0.7229 were achieved for SET pregnancy, DET pregnancy, and DET twin risk, respectively. The trend of predictive and observed rates both in pregnancy and twin risk was basically identical. XGBoost outperformed the other two algorithms: logistic regression and classification and regression tree. Conclusion: Artificial intelligence based on determinant-weighting analysis could offer an individualized embryo selection strategy for any given patient, and predict clinical pregnancy rate and twin risk, therefore optimizing clinical outcomes.


PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0258550
Author(s):  
Upendra Kumar Pradhan ◽  
Nitesh Kumar Sharma ◽  
Prakash Kumar ◽  
Ashwani Kumar ◽  
Sagar Gupta ◽  
...  

Formation of mature miRNAs and their expression is a highly controlled process. It is very much dependent upon the post-transcriptional regulatory events. Recent findings suggest that several RNA binding proteins beyond Drosha/Dicer are involved in the processing of miRNAs. Deciphering of conditional networks for these RBP-miRNA interactions may help to reason the spatio-temporal nature of miRNAs which can also be used to predict miRNA profiles. In this direction, >25TB of data from different platforms were studied (CLIP-seq/RNA-seq/miRNA-seq) to develop Bayesian causal networks capable of reasoning miRNA biogenesis. The networks ably explained the miRNA formation when tested across a large number of conditions and experimentally validated data. The networks were modeled into an XGBoost machine learning system where expression information of the network components was found capable to quantitatively explain the miRNAs formation levels and their profiles. The models were developed for 1,204 human miRNAs whose accurate expression level could be detected directly from the RNA-seq data alone without any need of doing separate miRNA profiling experiments like miRNA-seq or arrays. A first of its kind, miRbiom performed consistently well with high average accuracy (91%) when tested across a large number of experimentally established data from several conditions. It has been implemented as an interactive open access web-server where besides finding the profiles of miRNAs, their downstream functional analysis can also be done. miRbiom will help to get an accurate prediction of human miRNAs profiles in the absence of profiling experiments and will be an asset for regulatory research areas. The study also shows the importance of having RBP interaction information in better understanding the miRNAs and their functional projectiles where it also lays the foundation of such studies and software in future.


2021 ◽  
pp. annrheumdis-2020-218359
Author(s):  
Xinyi Meng ◽  
Xiaoyuan Hou ◽  
Ping Wang ◽  
Joseph T Glessner ◽  
Hui-Qi Qu ◽  
...  

ObjectiveJuvenile idiopathic arthritis (JIA) is the most common type of arthritis among children, but a few studies have investigated the contribution of rare variants to JIA. In this study, we aimed to identify rare coding variants associated with JIA for the genome-wide landscape.MethodsWe established a rare variant calling and filtering pipeline and performed rare coding variant and gene-based association analyses on three RNA-seq datasets composed of 228 JIA patients in the Gene Expression Omnibus against different sets of controls, and further conducted replication in our whole-exome sequencing (WES) data of 56 JIA patients. Then we conducted differential gene expression analysis and assessed the impact of recurrent functional coding variants on gene expression and signalling pathway.ResultsBy the RNA-seq data, we identified variants in two genes reported in literature as JIA causal variants, as well as additional 63 recurrent rare coding variants seen only in JIA patients. Among the 44 recurrent rare variants found in polyarticular patients, 10 were replicated by our WES of patients with the same JIA subtype. Several genes with recurrent functional rare coding variants have also common variants associated with autoimmune diseases. We observed immune pathways enriched for the genes with rare coding variants and differentially expressed genes.ConclusionThis study elucidated a novel landscape of recurrent rare coding variants in JIA patients and uncovered significant associations with JIA at the gene pathway level. The convergence of common variants and rare variants for autoimmune diseases is also highlighted in this study.


Author(s):  
Marina Azer ◽  
◽  
Mohamed Taha ◽  
Hala H. Zayed ◽  
Mahmoud Gadallah

Social media presence is a crucial portion of our life. It is considered one of the most important sources of information than traditional sources. Twitter has become one of the prevalent social sites for exchanging viewpoints and feelings. This work proposes a supervised machine learning system for discovering false news. One of the credibility detection problems is finding new features that are most predictive to better performance classifiers. Both features depending on new content, and features based on the user are used. The features' importance is examined, and their impact on the performance. The reasons for choosing the final feature set using the k-best method are explained. Seven supervised machine learning classifiers are used. They are Naïve Bayes (NB), Support vector machine (SVM), Knearest neighbors (KNN), Logistic Regression (LR), Random Forest (RF), Maximum entropy (ME), and conditional random forest (CRF). Training and testing models were conducted using the Pheme dataset. The feature's analysis is introduced and compared to the features depending on the content, as the decisive factors in determining the validity. Random forest shows the highest performance while using user-based features only and using a mixture of both types of features; features depending on content and the features based on the user, accuracy (82.2 %) in using user-based features only. We achieved the highest results by using both types of features, utilizing random forest classifier accuracy(83.4%). In contrast, logistic regression was the best as to using features that are based on contents. Performance is measured by different measurements accuracy, precision, recall, and F1_score. We compared our feature set with other studies' features and the impact of our new features. We found that our conclusions exhibit high enhancement concerning discovering and verifying the false news regarding the discovery and verification of false news, comparing it to the current results of how it is developed.


2021 ◽  
Vol 42 (Supplement_1) ◽  
Author(s):  
A Rosier ◽  
E Crespin ◽  
A Lazarus ◽  
G Laurent ◽  
A Menet ◽  
...  

Abstract Background Implantable Loop Recorders (ILRs) are increasingly used and generate a high workload for timely adjudication of ECG recordings. In particular, the excessive false positive rate leads to a significant review burden. Purpose A novel machine learning algorithm was developed to reclassify ILR episodes in order to decrease by 80% the False Positive rate while maintaining 99% sensitivity. This study aims to evaluate the impact of this algorithm to reduce the number of abnormal episodes reported in Medtronic ILRs. Methods Among 20 European centers, all Medtronic ILR patients were enrolled during the 2nd semester of 2020. Using a remote monitoring platform, every ILR transmitted episode was collected and anonymised. For every ILR detected episode with a transmitted ECG, the new algorithm reclassified it applying the same labels as the ILR (asystole, brady, AT/AF, VT, artifact, normal). We measured the number of episodes identified as false positive and reclassified as normal by the algorithm, and their proportion among all episodes. Results In 370 patients, ILRs recorded 3755 episodes including 305 patient-triggered and 629 with no ECG transmitted. 2821 episodes were analyzed by the novel algorithm, which reclassified 1227 episodes as normal rhythm. These reclassified episodes accounted for 43% of analyzed episodes and 32.6% of all episodes recorded. Conclusion A novel machine learning algorithm significantly reduces the quantity of episodes flagged as abnormal and typically reviewed by healthcare professionals. FUNDunding Acknowledgement Type of funding sources: None. Figure 1. ILR episodes analysis


2019 ◽  
Author(s):  
Charles Curnin ◽  
Rachel L. Goldfeder ◽  
Shruti Marwaha ◽  
Devon Bonner ◽  
Daryl Waggott ◽  
...  

AbstractInsertions and deletions (indels) make a critical contribution to human genetic variation. While indel calling has improved significantly, it lags dramatically in performance relative to single-nucleotide variant calling, something of particular concern for clinical genomics where larger scale disruption of the open reading frame can commonly cause disease. Here, we present a machine learning-based approach to the detection of indel breakpoints called Scotch. This novel approach improves sensitivity to larger variants dramatically by leveraging sequencing metrics and signatures of poor read alignment. We also introduce a meta-analytic indel caller, called Metal, that performs a “smart intersection” of Scotch and currently available tools to be maximally sensitive to large variants. We use new benchmark datasets and Sanger sequencing to compare Scotch and Metal to current gold standard indel callers, achieving unprecedented levels of precision and recall. We demonstrate the impact of these improvements by applying this tool to a cohort of patients with undiagnosed disease, generating plausible novel candidates in 21 out of 26 undiagnosed cases. We highlight the diagnosis of one patient with a 498-bp deletion in HNRNPA1 missed by traditional indel-detection tools.


2021 ◽  
Author(s):  
Beth Signal ◽  
Tim Kahlke

ABSTRACTORF prediction in de-novo assembled transcriptomes is a critical step for RNA-Seq analysis and transcriptome annotation. However, current approaches do not appropriately account for factors such as strand-specificity and incompletely assembled transcripts. Strand-specific RNA-Seq libraries should produce assembled transcripts in the correct orientation, and therefore ORFs should only be annotated on the sense strand. Additionally, start site selection is more complex than appreciated as sequences upstream of the first start codon need to be correctly annotated as 5’ UTR in completely assembled transcripts, or part of the main ORF in incomplete transcripts. Both of these factors influence the accurate annotation of ORFs and therefore the transcriptome as a whole. We generated four de-novo transcriptome assemblies of well annotated species as a gold-standard dataset to test the impact strand specificity and start site selection have on ORF prediction in real data. Our results show that prediction of ORFs on the antisense strand in data from stranded RNA libraries results in false-positive ORFs with no or very low similarity to known proteins. In addition, we found that up to 23% of assembled transcripts had no stop codon upstream and in-frame of the first start codon, instead comprising a sequence of upstream codons. We found the optimal length cutoff of these upstream sequences to accurately classify these transcripts as either complete (upstream sequence is 5’ UTR) or 5’ incomplete (transcript is incompletely assembled and upstream sequence is part of the ORF). Here, we present Borf, the better ORF finder, specifically designed to minimise false-positive ORF prediction in stranded RNA-Seq data and improve annotation of ORF start-site prediction accuracy. Borf is written in Python3 and freely available at https://github.com/betsig/borf.


Sign in / Sign up

Export Citation Format

Share Document