scholarly journals ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

Author(s):  
Otmar Ertl
2009 ◽  
Vol 03 (02) ◽  
pp. 209-234 ◽  
Author(s):  
YI YU ◽  
KAZUKI JOE ◽  
VINCENT ORIA ◽  
FABIAN MOERCHEN ◽  
J. STEPHEN DOWNIE ◽  
...  

Research on audio-based music retrieval has primarily concentrated on refining audio features to improve search quality. However, much less work has been done on improving the time efficiency of music audio searches. Representing music audio documents in an indexable format provides a mechanism for achieving efficiency. To address this issue, in this work Exact Locality Sensitive Mapping (ELSM) is suggested to join the concatenated feature sets and soft hash values. On this basis we propose audio-based music indexing techniques, ELSM and Soft Locality Sensitive Hash (SoftLSH) using an optimized Feature Union (FU) set of extracted audio features. Two contributions are made here. First, the principle of similarity-invariance is applied in summarizing audio feature sequences and utilized in training semantic audio representations based on regression. Second, soft hash values are pre-calculated to help locate the searching range more accurately and improve collision probability among features similar to each other. Our algorithms are implemented in a demonstration system to show how to retrieve and evaluate multi-version audio documents. Experimental evaluation over a real "multi-version" audio dataset confirms the practicality of ELSM and SoftLSH with FU and proves that our algorithms are effective for both multi-version detection (online query, one-query vs. multi-object) and same content detection (batch queries, multi-queries vs. one-object).


2005 ◽  
Vol 40 (10) ◽  
pp. 975-980 ◽  
Author(s):  
Maria Imaculada Zucchi ◽  
José Baldin Pinheiro ◽  
Lázaro José Chaves ◽  
Alexandre Siqueira Guedes Coelho ◽  
Mansuêmia Alves Couto ◽  
...  

This study was carried out to assess the genetic variability of ten "cagaita" tree (Eugenia dysenterica) populations in Southeastern Goiás. Fifty-four randomly amplified polymorphic DNA (RAPD) loci were used to characterize the population genetic variability, using the analysis of molecular variance (AMOVA). A phiST value of 0.2703 was obtained, showing that 27.03% and 72.97% of the genetic variability is present among and within populations, respectively. The Pearson correlation coefficient (r) among the genetic distances matrix (1 - Jaccard similarity index) and the geographic distances were estimated, and a strong positive correlation was detected. Results suggest that these populations are differentiating through a stochastic process, with restricted and geographic distribution dependent gene flow.


2020 ◽  
Author(s):  
Muhammad Afzal ◽  
Fakhare Alam ◽  
Khalid Mahmood Malik ◽  
Ghaus M Malik

BACKGROUND Automatic text summarization (ATS) enables users to retrieve meaningful evidence from big data of biomedical repositories to make complex clinical decisions. Deep neural and recurrent networks outperform traditional machine-learning techniques in areas of natural language processing and computer vision; however, they are yet to be explored in the ATS domain, particularly for medical text summarization. OBJECTIVE Traditional approaches in ATS for biomedical text suffer from fundamental issues such as an inability to capture clinical context, quality of evidence, and purpose-driven selection of passages for the summary. We aimed to circumvent these limitations through achieving precise, succinct, and coherent information extraction from credible published biomedical resources, and to construct a simplified summary containing the most informative content that can offer a review particular to clinical needs. METHODS In our proposed approach, we introduce a novel framework, termed Biomed-Summarizer, that provides quality-aware Patient/Problem, Intervention, Comparison, and Outcome (PICO)-based intelligent and context-enabled summarization of biomedical text. Biomed-Summarizer integrates the prognosis quality recognition model with a clinical context–aware model to locate text sequences in the body of a biomedical article for use in the final summary. First, we developed a deep neural network binary classifier for quality recognition to acquire scientifically sound studies and filter out others. Second, we developed a bidirectional long-short term memory recurrent neural network as a clinical context–aware classifier, which was trained on semantically enriched features generated using a word-embedding tokenizer for identification of meaningful sentences representing PICO text sequences. Third, we calculated the similarity between query and PICO text sequences using Jaccard similarity with semantic enrichments, where the semantic enrichments are obtained using medical ontologies. Last, we generated a representative summary from the high-scoring PICO sequences aggregated by study type, publication credibility, and freshness score. RESULTS Evaluation of the prognosis quality recognition model using a large dataset of biomedical literature related to intracranial aneurysm showed an accuracy of 95.41% (2562/2686) in terms of recognizing quality articles. The clinical context–aware multiclass classifier outperformed the traditional machine-learning algorithms, including support vector machine, gradient boosted tree, linear regression, K-nearest neighbor, and naïve Bayes, by achieving 93% (16127/17341) accuracy for classifying five categories: aim, population, intervention, results, and outcome. The semantic similarity algorithm achieved a significant Pearson correlation coefficient of 0.61 (0-1 scale) on a well-known BIOSSES dataset (with 100 pair sentences) after semantic enrichment, representing an improvement of 8.9% over baseline Jaccard similarity. Finally, we found a highly positive correlation among the evaluations performed by three domain experts concerning different metrics, suggesting that the automated summarization is satisfactory. CONCLUSIONS By employing the proposed method Biomed-Summarizer, high accuracy in ATS was achieved, enabling seamless curation of research evidence from the biomedical literature to use for clinical decision-making.


2018 ◽  
Vol 77 (22) ◽  
pp. 29435-29455
Author(s):  
Yuhua Jia ◽  
Liang Bai ◽  
Peng Wang ◽  
Jinlin Guo ◽  
Yuxiang Xie ◽  
...  

Paleobiology ◽  
2021 ◽  
pp. 1-18
Author(s):  
Daniel G. Dick ◽  
Marc Laflamme

Abstract Classic similarity indices measure community resemblance in terms of incidence (the number of shared species) and abundance (the extent to which the shared species are an equivalently large component of the ecosystem). Here we describe a general method for increasing the amount of information contained in the output of these indices and describe a new “soft” ecological similarity measure (here called “soft Chao-Jaccard similarity”). The new measure quantifies community resemblance in terms of shared species, while accounting for intraspecific variation in abundance and morphology between samples. We demonstrate how our proposed measure can reconstruct short ecological gradients using random samples of taxa, recognizing patterns that are completely missed by classic measures of similarity. To demonstrate the utility of our new index, we reconstruct a morphological gradient driven by river flow velocity using random samples drawn from simulated and real-world data. Results suggest that the new index can be used to recognize complex short ecological gradients in settings where only information about specimens is available. We include open-source R code for calculating the proposed index.


Homeopathy ◽  
2021 ◽  
Author(s):  
Kurian Poruthukaren

Abstract Background The critical task of researchers conducting double-blinded, randomized, placebo-controlled homeopathic pathogenetic trials is to segregate the signals from the noises. The noises are signs and symptoms due to factors other than the trial drug; signals are signs and symptoms due to the trial drug. Unfortunately, the existing tools (criteria for a causal association of symptoms only with the tested medicine, qualitative pathogenetic index, quantitative pathogenetic index, pathogenic index) have limitations in analyzing the symptoms of the placebo group as a comparator, resulting in inadequate segregation of the noises. Hence, the Jaccard similarity index and the Noise index are proposed for analyzing the symptoms of the placebo group as a comparator. Methods The Jaccard similarity index is the ratio of the number of common elements among the placebo and intervention groups to the aggregated number of elements in these groups. The Noise index is the ratio of common elements among the placebo and intervention group to the total elements of the intervention group. Homeopathic pathogenetic trials of Plumbum metallicum, Piper methysticum and Hepatitis C nosode were selected for experimenting with the computation of the Jaccard similarity index and the Noise index. Results Jaccard similarity index calculations show that 8% of Plumbum metallicum's elements, 10.7% of Piper methysticum's elements, and 19.3% of Hepatitis C nosode's elements were similar to the placebo group when elements of both the groups (intervention and placebo) were aggregated. Noise index calculations show that 10.7% of Plumbum metallicum's elements, 13.9% of Piper methysticum's elements and 25.7% of Hepatitis C nosode's elements were similar to those of the placebo group. Conclusion The Jaccard similarity index and the Noise index might be considered an additional approach for analyzing the symptoms of the placebo group as a comparator, resulting in better noise segregation in homeopathic pathogenetic trials.


Sign in / Sign up

Export Citation Format

Share Document