Supervised biomedical semantic similarity

Influence of the go-based semantic similarity measures in multi-objective gene clustering algorithm performance

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020500389 ◽

2020 ◽

Vol 18 (06) ◽

pp. 2050038

Author(s):

Jorge Parraga-Alava ◽

Mario Inostroza-Ponta

Keyword(s):

Semantic Similarity ◽

Clustering Algorithm ◽

Performance Metrics ◽

Expression Patterns ◽

Biological Significance ◽

Similarity Measures ◽

Gene Clustering ◽

Biological Knowledge ◽

Multi Objective ◽

Gene Similarity

Using a prior biological knowledge of relationships and genetic functions for gene similarity, from repository such as the Gene Ontology (GO), has shown good results in multi-objective gene clustering algorithms. In this scenario and to obtain useful clustering results, it would be helpful to know which measure of biological similarity between genes should be employed to yield meaningful clusters that have both similar expression patterns (co-expression) and biological homogeneity. In this paper, we studied the influence of the four most used GO-based semantic similarity measures in the performance of a multi-objective gene clustering algorithm. We used four publicly available datasets and carried out comparative studies based on performance metrics for the multi-objective optimization field and clustering performance indexes. In most of the cases, using Jiang–Conrath and Wang similarities stand in terms of multi-objective metrics. In clustering properties, Resnik similarity allows to achieve the best values of compactness and separation and therefore of co-expression of groups of genes. Meanwhile, in biological homogeneity, the Wang similarity reports greater number of significant GO terms. However, statistical, visual, and biological significance tests showed that none of the GO-based semantic similarity measures stand out above the rest in order to significantly improve the performance of the multi-objective gene clustering algorithm.

Download Full-text

Semi-supervised Machine Learning Aided Anomaly Detection Method in Cellular Networks

10.36227/techrxiv.11634720 ◽

2020 ◽

Author(s):

Yutao Lu ◽

Juan Wang ◽

Miao Liu ◽

Kaixuan Zhang ◽

Guan Gui ◽

...

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Cellular Networks ◽

Expert Knowledge ◽

Learning Algorithm ◽

Positive Sample ◽

Supervised Machine Learning ◽

Support Vector ◽

Soft Decision ◽

Decision Methods

The ever-increasing amount of data in cellular networks poses challenges for network operators to monitor the quality of experience (QoE). Traditional key quality indicators (KQIs)-based hard decision methods are difficult to undertake the task of QoE anomaly detection in the case of big data. To solve this problem, in this paper, we propose a KQIs-based QoE anomaly detection framework using semi-supervised machine learning algorithm, i.e., iterative positive sample aided one-class support vector machine (IPS-OCSVM). There are four steps for realizing the proposed method while the key step is combining machine learning with the network operator's expert knowledge using OCSVM. Our proposed IPS-OCSVM framework realizes QoE anomaly detection through soft decision and can easily fine-tune the anomaly detection ability on demand. Moreover, we prove that the fluctuation of KQIs thresholds based on expert knowledge has a limited impact on the result of anomaly detection. Finally, experiment results are given to confirm the proposed IPS-OCSVM framework for QoE anomaly detection in cellular networks.

Download Full-text

Ecological interactions and the Netflix problem

PeerJ ◽

10.7717/peerj.3644 ◽

2017 ◽

Vol 5 ◽

pp. e3644 ◽

Cited By ~ 16

Author(s):

Philippe Desjardins-Proulx ◽

Idaline Laigle ◽

Timothée Poisot ◽

Dominique Gravel

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Random Forests ◽

Species Interactions ◽

Similarity Measures ◽

Theoretical Models ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Nearest Neighbour ◽

Ecological Interactions

Species interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with a supervised machine learning technique. Recommenders are algorithms developed for companies like Netflix to predict whether a customer will like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized for ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.

Download Full-text

Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology

10.1101/472217 ◽

2018 ◽

Author(s):

Muhammad Asif ◽

Hugo F. M. C. M. Martiniano ◽

Astrid M. Vicente ◽

Francisco M. Couto

Keyword(s):

Machine Learning ◽

Gene Ontology ◽

Candidate Genes ◽

Semantic Similarity ◽

Quantitative Measure ◽

Complex Diseases ◽

Supervised Machine Learning ◽

Disease Genes ◽

Machine Learning Classifiers ◽

Learning Classifiers

AbstractIdentifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data.In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD).We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.

Download Full-text

Semantic similarity and machine learning with ontologies

Briefings in Bioinformatics ◽

10.1093/bib/bbaa199 ◽

2020 ◽

Author(s):

Maxat Kulmanov ◽

Fatima Zohra Smaili ◽

Xin Gao ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Semantic Similarity ◽

Domain Knowledge ◽

Life Sciences ◽

Similarity Measures ◽

Background Knowledge ◽

Biological Database ◽

Learning Models ◽

Machine Learning Methods ◽

Machine Learning Models

Abstract Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.

Download Full-text

Predicting Inpatient Glucose Levels and Insulin Dosing by Machine Learning on Electronic Health Records

10.1101/2020.03.02.20029017 ◽

2020 ◽

Author(s):

Xiran Liu ◽

Ivana Jankovic ◽

Jonathan H Chen

Keyword(s):

Machine Learning ◽

Diabetes Management ◽

Expert Knowledge ◽

Supervised Machine Learning ◽

Health Records ◽

Large Variability ◽

Glucose Levels ◽

Machine Learning Methods ◽

Blood Glucose Levels ◽

Electronic Health

AbstractPoorly controlled glucose levels are associated with serious morbidity and mortality in hospitalized patients. Hospital diabetes management aims to maintain the glucose level within a desired range, primarily via insulin administration. Current inpatient glucose control relies significantly on expert knowledge, but this results in large variability and often suboptimal blood sugars in practice. We applied supervised machine learning methods to electronic health record (EHR) data to build predictive models that can inform inpatient insulin management. We found that individual blood glucose levels and insulin dosing are highly erratic and cannot be predicted precisely (MAE 28mg/dL, R2 0.2). However, prescribing decisions can still be driven by the more reliable predictions of average daily glucose levels (MAE 21mg/dL, R2 0.4) and whether any patient’s glucose levels will be higher than the clinically desired range in the next day (sens 0.73, spec 0.79).

Download Full-text

Semi-supervised Machine Learning Aided Anomaly Detection Method in Cellular Networks

10.36227/techrxiv.11634720.v1 ◽

2020 ◽

Author(s):

Yutao Lu ◽

Juan Wang ◽

Miao Liu ◽

Kaixuan Zhang ◽

Guan Gui ◽

...

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Cellular Networks ◽

Expert Knowledge ◽

Learning Algorithm ◽

Positive Sample ◽

Supervised Machine Learning ◽

Support Vector ◽

Soft Decision ◽

Decision Methods

The ever-increasing amount of data in cellular networks poses challenges for network operators to monitor the quality of experience (QoE). Traditional key quality indicators (KQIs)-based hard decision methods are difficult to undertake the task of QoE anomaly detection in the case of big data. To solve this problem, in this paper, we propose a KQIs-based QoE anomaly detection framework using semi-supervised machine learning algorithm, i.e., iterative positive sample aided one-class support vector machine (IPS-OCSVM). There are four steps for realizing the proposed method while the key step is combining machine learning with the network operator's expert knowledge using OCSVM. Our proposed IPS-OCSVM framework realizes QoE anomaly detection through soft decision and can easily fine-tune the anomaly detection ability on demand. Moreover, we prove that the fluctuation of KQIs thresholds based on expert knowledge has a limited impact on the result of anomaly detection. Finally, experiment results are given to confirm the proposed IPS-OCSVM framework for QoE anomaly detection in cellular networks.

Download Full-text

Evolving knowledge graph similarity for supervised learning in complex biomedical domains

BMC Bioinformatics ◽

10.1186/s12859-019-3296-1 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 7

Author(s):

Rita T. Sousa ◽

Sara Silva ◽

Catia Pesquita

Keyword(s):

Supervised Learning ◽

Semantic Similarity ◽

Protein Interaction ◽

Expert Knowledge ◽

Learning Task ◽

Knowledge Graph ◽

Interaction Prediction ◽

Protein Protein Interaction ◽

Protein Interaction Prediction ◽

Knowledge Graphs

Abstract Background In recent years, biomedical ontologies have become important for describing existing biological knowledge in the form of knowledge graphs. Data mining approaches that work with knowledge graphs have been proposed, but they are based on vector representations that do not capture the full underlying semantics. An alternative is to use machine learning approaches that explore semantic similarity. However, since ontologies can model multiple perspectives, semantic similarity computations for a given learning task need to be fine-tuned to account for this. Obtaining the best combination of semantic similarity aspects for each learning task is not trivial and typically depends on expert knowledge. Results We have developed a novel approach, evoKGsim, that applies Genetic Programming over a set of semantic similarity features, each based on a semantic aspect of the data, to obtain the best combination for a given supervised learning task. The approach was evaluated on several benchmark datasets for protein-protein interaction prediction using the Gene Ontology as the knowledge graph to support semantic similarity, and it outperformed competing strategies, including manually selected combinations of semantic aspects emulating expert knowledge. evoKGsim was also able to learn species-agnostic models with different combinations of species for training and testing, effectively addressing the limitations of predicting protein-protein interactions for species with fewer known interactions. Conclusions evoKGsim can overcome one of the limitations in knowledge graph-based semantic similarity applications: the need to expertly select which aspects should be taken into account for a given application. Applying this methodology to protein-protein interaction prediction proved successful, paving the way to broader applications.

Download Full-text

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

F1000Research ◽

10.12688/f1000research.25877.1 ◽

2020 ◽

Vol 9 ◽

pp. 1186

Author(s):

Caitlin E. Coombes ◽

Zachary B. Abrams ◽

Samantha Nakayiza ◽

Guy Brock ◽

Kevin R. Coombes

Keyword(s):

Machine Learning ◽

Clinical Data ◽

Mixed Type ◽

R Package ◽

Fine Tuning ◽

Supervised Machine Learning ◽

Mixed Data ◽

Operating Characteristics ◽

Data Types ◽

Type Data

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Download Full-text

Evaluation of taxonomic and neural embedding methods for calculating semantic similarity

Natural Language Engineering ◽

10.1017/s1351324921000279 ◽

2021 ◽

pp. 1-29

Author(s):

Dongqiang Yang ◽

Yanqin Yin

Keyword(s):

Semantic Similarity ◽

Word Frequency ◽

Similarity Measures ◽

Semantic Networks ◽

Knowledge Bases ◽

Fine Tuning ◽

Language Models ◽

Word Similarity ◽

Uniform Distance ◽

Measure Word

Abstract Modelling semantic similarity plays a fundamental role in lexical semantic applications. A natural way of calculating semantic similarity is to access handcrafted semantic networks, but similarity prediction can also be anticipated in a distributional vector space. Similarity calculation continues to be a challenging task, even with the latest breakthroughs in deep neural language models. We first examined popular methodologies in measuring taxonomic similarity, including edge-counting that solely employs semantic relations in a taxonomy, as well as the complex methods that estimate concept specificity. We further extrapolated three weighting factors in modelling taxonomic similarity. To study the distinct mechanisms between taxonomic and distributional similarity measures, we ran head-to-head comparisons of each measure with human similarity judgements from the perspectives of word frequency, polysemy degree and similarity intensity. Our findings suggest that without fine-tuning the uniform distance, taxonomic similarity measures can depend on the shortest path length as a prime factor to predict semantic similarity; in contrast to distributional semantics, edge-counting is free from sense distribution bias in use and can measure word similarity both literally and metaphorically; the synergy of retrofitting neural embeddings with concept relations in similarity prediction may indicate a new trend to leverage knowledge bases on transfer learning. It appears that a large gap still exists on computing semantic similarity among different ranges of word frequency, polysemous degree and similarity intensity.

Download Full-text