scholarly journals Machine Learning Approaches for the Prioritization of Genomic Variants Impacting Pre-mRNA Splicing

Author(s):  
Charles F Rowlands ◽  
Diana Baralle ◽  
Jamie M Ellingford

Defects in pre-mRNA splicing are frequently a cause of Mendelian disease. Despite the advent of next-generation sequencing, allowing a deeper insight into a patient’s variant landscape, the ability to characterize variants causing splicing defects has not progressed with the same speed. To address this, recent years have seen a sharp spike in the number of splice prediction tools leveraging machine learning approaches, leaving clinical geneticists with a plethora of choices for in silico analysis. In this Review, some basic principles of machine learning are introduced in the context of genomics and splicing analysis. A critical comparative approach is then used to describe seven recent machine learning-based splice prediction tools, revealing highly diverse approaches and common caveats. We find that, although great progress has been made in producing specific and sensitive tools, there is still much scope for personalized approaches to prediction of variant impact on splicing. Such approaches may increase diagnostic yields and underpin improvements to patient care.

Cells ◽  
2019 ◽  
Vol 8 (12) ◽  
pp. 1513 ◽  
Author(s):  
Charlie F Rowlands ◽  
Diana Baralle ◽  
Jamie M Ellingford

Defects in pre-mRNA splicing are frequently a cause of Mendelian disease. Despite the advent of next-generation sequencing, allowing a deeper insight into a patient’s variant landscape, the ability to characterize variants causing splicing defects has not progressed with the same speed. To address this, recent years have seen a sharp spike in the number of splice prediction tools leveraging machine learning approaches, leaving clinical geneticists with a plethora of choices for in silico analysis. In this review, some basic principles of machine learning are introduced in the context of genomics and splicing analysis. A critical comparative approach is then used to describe seven recent machine learning-based splice prediction tools, revealing highly diverse approaches and common caveats. We find that, although great progress has been made in producing specific and sensitive tools, there is still much scope for personalized approaches to prediction of variant impact on splicing. Such approaches may increase diagnostic yields and underpin improvements to patient care.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Adonis D’Mello ◽  
Christian P. Ahearn ◽  
Timothy F. Murphy ◽  
Hervé Tettelin

Abstract Background Reverse vaccinology accelerates the discovery of potential vaccine candidates (PVCs) prior to experimental validation. Current programs typically use one bacterial proteome to identify PVCs through a filtering architecture using feature prediction programs or a machine learning approach. Filtering approaches may eliminate potential antigens based on limitations in the accuracy of prediction tools used. Machine learning approaches are heavily dependent on the selection of training datasets with experimentally validated antigens (positive control) and non-protective-antigens (negative control). The use of one or few bacterial proteomes does not assess PVC conservation among strains, an important feature of vaccine antigens. Results We present ReVac, which implements both a panoply of feature prediction programs without filtering out proteins, and scoring of candidates based on predictions made on curated positive and negative control PVCs datasets. ReVac surveys several genomes assessing protein conservation, as well as DNA and protein repeats, which may result in variable expression of PVCs. ReVac’s orthologous clustering of conserved genes, identifies core and dispensable genome components. This is useful for determining the degree of conservation of PVCs among the population of isolates for a given pathogen. Potential vaccine candidates are then prioritized based on conservation and overall feature-based scoring. We present the application of ReVac, applied to 69 Moraxella catarrhalis and 270 non-typeable Haemophilus influenzae genomes, prioritizing 64 and 29 proteins as PVCs, respectively. Conclusion ReVac’s use of a scoring scheme ranks PVCs for subsequent experimental testing. It employs a redundancy-based approach in its predictions of features using several prediction tools. The protein’s features are collated, and each protein is ranked based on the scoring scheme. Multi-genome analyses performed in ReVac allow for a comprehensive overview of PVCs from a pan-genome perspective, as an essential pre-requisite for any bacterial subunit vaccine design. ReVac prioritized PVCs of two human respiratory pathogens, identifying both novel and previously validated PVCs.


Author(s):  
Ruihan Zhang ◽  
Xiaoli Li ◽  
Xingjie Zhang ◽  
Huayan Qin ◽  
Weilie Xiao

This review presents the basic principles, protocols and examples of using the machine learning approaches to investigate the bioactivity of natural products.


2021 ◽  
Vol 38 ◽  
pp. 00142
Author(s):  
Moisey Zakharov ◽  
Mikhail Cherosov ◽  
Elena Troeva ◽  
Sebastien Gadal

For the first time, the geoinformation modelling and machine learning approaches have been used to study the vegetation cover of the mountainous part of North-Eastern Siberia – the Orulgan medium-altitude mountain landscape province. These technologies allowed us to distinguish a number of mapping units that were used for creation and analysis of 1:100 000 scale vegetation map of the interpreted key area. Based on the studies, we decided upon the basic principles, approaches and technologies that would serve as a methodology basis for the further studies of vegetation cover of the large region. Relief, slope aspect, genetic types of sediments, and moisture conditions were selected as supplementary factors to the vegetative indices for differentiation of both plant communities and vegetation map units.


1976 ◽  
Vol 15 (02) ◽  
pp. 69-74
Author(s):  
M. Goldberg ◽  
B. Doyon

This paper describes a general data base management package, devoted to medical applications. SARI is a user-oriented system, able to take into account applications very different by their nature, structure, size, operating procedures and general objectives, without any specific programming. It can be used in conversational mode by users with no previous knowledge of computers, such as physicians or medical clerks.As medical data are often personal data, the privacy problem is emphasized and a satisfactory solution implemented in SARI.The basic principles of the data base and program organization are described ; specific efforts have been made in order to increase compactness and to make maintenance easy.Several medical applications are now operational with SARI. The next steps will mainly consist in the implementation of highly sophisticated functions.


2019 ◽  
Vol 70 (3) ◽  
pp. 214-224
Author(s):  
Bui Ngoc Dung ◽  
Manh Dzung Lai ◽  
Tran Vu Hieu ◽  
Nguyen Binh T. H.

Video surveillance is emerging research field of intelligent transport systems. This paper presents some techniques which use machine learning and computer vision in vehicles detection and tracking. Firstly the machine learning approaches using Haar-like features and Ada-Boost algorithm for vehicle detection are presented. Secondly approaches to detect vehicles using the background subtraction method based on Gaussian Mixture Model and to track vehicles using optical flow and multiple Kalman filters were given. The method takes advantages of distinguish and tracking multiple vehicles individually. The experimental results demonstrate high accurately of the method.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


Sign in / Sign up

Export Citation Format

Share Document