scholarly journals An analysis and comparison of the statistical sensitivity of semantic similarity metrics

2018 ◽  
Author(s):  
Prashanti Manda ◽  
Todd Vision

1AbstractSemantic similarity has been used for comparing genes, proteins, phenotypes, diseases, etc. for various biological applications. The rise of ontology-based data representation in biology has also led to the development of several semantic similarity metrics that use different statistics to estimate similarity.Although semantic similarity has become a crucial computational tool in several applications, there has not been a formal evaluation of the statistical sensitivity of these metrics and their ability to recognize similarity between distantly related biological objects.Here, we present a statistical sensitivity comparison of five semantic similarity metrics (Jaccard, Resnik, Lin, Jiang& Conrath, and Hybrid Relative Specificity Similarity) representing three different kinds of metrics (Edge based, Node based, and Hybrid) and explore key parameter choices that can impact sensitivity. Furthermore, we compare four methods of aggregating individual annotation similarities to estimate similarity between two biological objects - All Pairs, Best Pairs, Best Pairs Symmetric, and Groupwise.To evaluate sensitivity in a controlled fashion, we explore two different models for simulating data with varying levels of similarity and compare to the noise distribution using resampling. Source data are derived from the Phenoscape Knowledgebase of evolutionary phenotypes.Our results indicate that the choice of similarity metric along with different parameter choices can substantially affect sensitivity. Among the five metrics evaluated, we find that Resnik similarity shows the greatest sensitivity to weak semantic similarity. Among the ways to combine pairwise statistics, the Groupwise approach provides the greatest discrimination among values above the sensitivity threshold, while the Best Pairs statistic can be parametrically tuned to provide the highest sensitivity.Our findings serve as a guideline for an appropriate choice and parameterization of semantic similarity metrics, and point to the need for improved reporting of the statistical significance of semantic similarity matches in cases where weak similarity is of interest

2016 ◽  
Author(s):  
Prashanti Manda ◽  
James P Balhoff ◽  
Todd J Vision

In phenotype annotations curated from the biological and medical literature, considerable human effort must be invested to select ontological classes that capture the expressivity of the original natural language descriptions, and finer annotation granularity can also entail higher computational costs for particular reasoning tasks. Do coarse annotations suffice for certain applications? Here, we measure how annotation granularity affects the statistical behavior of semantic similarity metrics. We use a randomized dataset of phenotype profiles drawn from 57,051 taxon-phenotype annotations in the Phenoscape Knowledgebase. We compared query profiles having variable proportions of matching phenotypes to subject database profiles using both pairwise and groupwise Jaccard (edge-based) and Resnik (node-based) semantic similarity metrics, and compared statistical performance for three different levels of annotation granularity: entities alone, entities plus attributes, and entities plus qualities (with implicit attributes). All four metrics examined showed more extreme values than expected by chance when approximately half the annotations matched between the query and subject profiles, with a more sudden decline for pairwise statistics and a more gradual one for the groupwise statistics. Annotation granularity had a negligible effect on the position of the threshold at which matches could be discriminated from noise. These results suggest that coarse annotations of phenotypes, at the level of entities with or without attributes, may be sufficient to identify phenotype profiles with statistically significant semantic similarity.


2021 ◽  
Vol 24 ◽  
pp. 256-266
Author(s):  
Nihayatul Karimah ◽  
Gijs Schaftenaar

Purpose: Structurally similar molecules are likely to have similar biological activity. In this study, similarity searching based on molecular 2D fingerprint was performed to analyze off-target effects of drugs. The purpose of this study is to determine the correlation between the adverse effects and drug off-targets. Methods: A workflow was built using KNIME to run dataset preparation of twenty-nine targets from ChEMBL, generate molecular 2D fingerprints of the ligands, calculate the similarity between ligand sets, and compute the statistical significance using similarity ensemble approach (SEA). Tanimoto coefficients (Tc) are used as a measure of chemical similarity in which the values between 0.2 and 0.4 are the most common for the majority of ligand pairs and considered to be insignificant similar. Result: The majority of ligand sets are unrelated, as is evidenced by the intrinsic chemical differences and the classification of statistical significance based on expectation value. The rank-ordered expectation value of inter-target similarity showed a correlation with off-target effects of the known drugs. Conclusion: Similarity-searching using molecular 2D fingerprint can be applied to predict off-targets and correlate them to the adverse effects of the drugs. KNIME as an open-source data analytic platform is applicable to build a workflow for data mining of ChEMBL database and generating SEA statistical model.


Author(s):  
Silvia Likavec ◽  
Francesco Osborne ◽  
Federica Cena

The authors introduce new measures of semantic similarity and relatedness for ontological concepts, based on the properties associated to them. They consider two concepts similar if, for some properties they have in common, they also have the same values assigned to these properties. On the other hand, the authors consider two concepts related if they have the same values assigned to different properties. These measures are used in the propagation of user interest values in ontology-based user models to other similar or related concepts in the domain. The authors tested their algorithm in event recommendation domain and in recipe domain and showed that property-based propagation based on similarity outperforms the standard edge-based propagation. Adding relatedness as a criterion for propagation improves diversity without sacrificing accuracy. In addition, assigning a certain relevance to each property improves the accuracy of recommendation. Finally, the property-based spreading activation is effective for cross-domain recommendation.


2018 ◽  
Vol 14 (4) ◽  
pp. 39-54 ◽  
Author(s):  
Tan Li Im ◽  
Phang Wai San ◽  
Patricia Anthony ◽  
Chin Kim On

This article discusses polarity classification for financial news articles. The proposed Semantic Sentiment Analyser makes use of semantic similarity techniques, sentiment composition rules, and the Positivity/Negativity (P/N) ratio in performing polarity classification. An experiment was conducted to compare the performance of three semantic similarity metrics namely HSO, LESK, and LIN to find the semantically similar pair of word as the input word. The best similarity technique (HSO) is incorporated into the sentiment analyser to find the possible polarity carrier from the analysed text before performing polarity classification. The performance of the proposed Semantic Sentiment Analyser was evaluated using a set of manually annotated financial news articles. The results obtained from the experiment showed that the proposed SSA was able to achieve an F-Score of 90.89% for all cases classification.


Author(s):  
A. S. Aksenov

This paper considers the issue of creating a software tool that provides the performing of the analysis of data received via various data transfer interfaces. The performing of the analysis helps to check the correctness of formation of the structure of informational and controlling messages of various components of a system under development, as well as the correctness of the network interaction and testing debugging and adjustment of software and hardware in terms of their information interaction at the level of information compatibility. A comparison with the existing network activity analysis tools is presented and several approaches to solving the issue of data analysis at the application level are compared. The article provides the validity of the choice of a unified format of srcML source data representation. Also it specifies the directions for further development of the analyzing program within present project solution. The expediency of the development of this software tool and the results of its application in the development of special‑purpose hardware and software suite are given in the conclusion.


Author(s):  
S. Cheng ◽  
M. Dou ◽  
J. Wang ◽  
S. Zhang ◽  
X. Chen

For an irrigation area that is often complicated by various 3D artificial ground features and natural environment, disadvantages of traditional 2D GIS in spatial data representation, management, query, analysis and visualization is becoming more and more evident. Building a more realistic 3D virtual scene is thus especially urgent for irrigation area managers and decision makers, so that they can carry out various irrigational operations lively and intuitively. Based on previous researchers' achievements, a simple, practical and cost-effective approach was proposed in this study, by adopting3D geographic information system (3D GIS), remote sensing (RS) technology. Based on multi-source data such as Google Earth (GE) high-resolution remote sensing image, ASTER G-DEM, hydrological facility maps and so on, 3D terrain model and ground feature models were created interactively. Both of the models were then rendered with texture data and integrated under ArcGIS platform. A vivid, realistic 3D virtual scene of irrigation area that has a good visual effect and possesses primary GIS functions about data query and analysis was constructed.Yet, there is still a long way to go for establishing a true 3D GIS for the irrigation are: issues of this study were deeply discussed and future research direction was pointed out in the end of the paper.


Author(s):  
Maike Knoechelmann ◽  
Garth Davies ◽  
Logan Macnair

Prominent terrorism case studies of individuals such as Omar Mateen, Dylann Roof, and Mohammed Merah indicate the need for personality trait-based terrorism risk assessment/threat assessment (TR/TA). This chapter provides an overview of Corrado’s, personality-based TR/TA instrument (see Chapter 14) by explaining the origin of each domain and the purpose of inclusion. Furthermore, this chapter displays results from a preliminary instrument validation study conducted on an open-source sample of 158 terrorists. Results of this study suggest strong statistical significance for many of the domains. This suggests the need for future inclusion of personality-based indicators in terrorism risk assessment.


Sign in / Sign up

Export Citation Format

Share Document