Differentially Private Sketches for Jaccard Similarity Estimation

Locally private Jaccard similarity estimation

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.4889 ◽

2018 ◽

Vol 31 (24) ◽

Cited By ~ 1

Author(s):

Ziqi Yan ◽

Qiong Wu ◽

Meng Ren ◽

Jiqiang Liu ◽

Shaowu Liu ◽

...

Keyword(s):

Jaccard Similarity ◽

Similarity Estimation

Download Full-text

A Survey of Network Embedding for Drug Analysis and Prediction

Current Protein and Peptide Science ◽

10.2174/1389203721666200702145701 ◽

2020 ◽

Vol 21 ◽

Author(s):

Zhixian Liu ◽

Qingfeng Chen ◽

Wei Lan ◽

Jiahai Liang ◽

Yiping Pheobe Chen ◽

...

Keyword(s):

Deep Learning ◽

Protein Function ◽

Dimensional Space ◽

Auxiliary Information ◽

Matrix Decomposition ◽

Drug Analysis ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Network Embedding ◽

Similarity Estimation

: Traditional network-based computational methods have shown good results in drug analysis and prediction. However, these methods are time consuming and lack universality, and it is difficult to exploit the auxiliary information of nodes and edges. Network embedding provides a promising way for alleviating the above problems by transforming network into a low-dimensional space while preserving network structure and auxiliary information. This thus facilitates the application of machine learning algorithms for subsequent processing. Network embedding has been introduced into drug analysis and prediction in the last few years, and has shown superior performance over traditional methods. However, there is no systematic review of this issue. This article offers a comprehensive survey of the primary network embedding methods and their applications in drug analysis and prediction. The network embedding technologies applied in homogeneous network and heterogeneous network are investigated and compared, including matrix decomposition, random walk, and deep learning. Especially, the Graph neural network (GNN) methods in deep learning are highlighted. Further, the applications of network embedding in drug similarity estimation, drug-target interaction prediction, adverse drug reactions prediction, protein function and therapeutic peptides prediction are discussed. Several future potential research directions are also discussed.

Download Full-text

Patent relatedness and velocity in the Chinese pharmaceutical industry: A dataset of Jaccard similarity indices

Data in Brief ◽

10.1016/j.dib.2021.106814 ◽

2021 ◽

Vol 35 ◽

pp. 106814

Author(s):

Charlotte Marie Vorreuther ◽

Thierry Warin

Keyword(s):

Pharmaceutical Industry ◽

Jaccard Similarity ◽

Similarity Indices

Download Full-text

Sentence similarity evaluation using Sent2Vec and siamese neural network with parallel structure

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189593 ◽

2021 ◽

pp. 1-10

Author(s):

Hye-Jeong Song ◽

Tak-Sung Heo ◽

Jong-Dae Kim ◽

Chan-Young Park ◽

Yu-Seop Kim

Keyword(s):

Neural Network ◽

Language Processing ◽

Short Term Memory ◽

Parallel Structure ◽

Short Term ◽

Similarity Estimation ◽

Accurate Judgment ◽

Proposed Model ◽

Sentence Similarity ◽

Long Short Term Memory

Sentence similarity evaluation is a significant task used in machine translation, classification, and information extraction in the field of natural language processing. When two sentences are given, an accurate judgment should be made whether the meaning of the sentences is equivalent even if the words and contexts of the sentences are different. To this end, existing studies have measured the similarity of sentences by focusing on the analysis of words, morphemes, and letters. To measure sentence similarity, this study uses Sent2Vec, a sentence embedding, as well as morpheme word embedding. Vectors representing words are input to the 1-dimension convolutional neural network (1D-CNN) with various sizes of kernels and bidirectional long short-term memory (Bi-LSTM). Self-attention is applied to the features transformed through Bi-LSTM. Subsequently, vectors undergoing 1D-CNN and self-attention are converted through global max pooling and global average pooling to extract specific values, respectively. The vectors generated through the above process are concatenated to the vector generated through Sent2Vec and are represented as a single vector. The vector is input to softmax layer, and finally, the similarity between the two sentences is determined. The proposed model can improve the accuracy by up to 5.42% point compared with the conventional sentence similarity estimation models.

Download Full-text

The Combined Method of Semantic Similarity Estimation of Problem Oriented Knowledge on the Basis of Evolutionary Procedures

Advances in Intelligent Systems and Computing - Artificial Intelligence Trends in Intelligent Systems ◽

10.1007/978-3-319-57261-1_8 ◽

2017 ◽

pp. 74-83 ◽

Cited By ~ 2

Author(s):

V. V. Bova ◽

E. V. Nuzhnov ◽

V. V. Kureichik

Keyword(s):

Semantic Similarity ◽

Combined Method ◽

Similarity Estimation

Download Full-text

Musical perceptual similarity estimation using interactive genetic algorithm

IEEE Congress on Evolutionary Computation ◽

10.1109/cec.2010.5586527 ◽

2010 ◽

Cited By ~ 3

Author(s):

Shangfei Wang ◽

Hua Zhu

Keyword(s):

Genetic Algorithm ◽

Perceptual Similarity ◽

Interactive Genetic Algorithm ◽

Similarity Estimation

Download Full-text

Genetic structure and gene flow of Eugenia dysenterica natural populations

Pesquisa Agropecuária Brasileira ◽

10.1590/s0100-204x2005001000005 ◽

2005 ◽

Vol 40 (10) ◽

pp. 975-980 ◽

Cited By ~ 22

Author(s):

Maria Imaculada Zucchi ◽

José Baldin Pinheiro ◽

Lázaro José Chaves ◽

Alexandre Siqueira Guedes Coelho ◽

Mansuêmia Alves Couto ◽

...

Keyword(s):

Gene Flow ◽

Genetic Variability ◽

Pearson Correlation ◽

Natural Populations ◽

Similarity Index ◽

Genetic Distances ◽

Strong Positive Correlation ◽

Randomly Amplified Polymorphic Dna ◽

Jaccard Similarity ◽

Eugenia Dysenterica

This study was carried out to assess the genetic variability of ten "cagaita" tree (Eugenia dysenterica) populations in Southeastern Goiás. Fifty-four randomly amplified polymorphic DNA (RAPD) loci were used to characterize the population genetic variability, using the analysis of molecular variance (AMOVA). A phiST value of 0.2703 was obtained, showing that 27.03% and 72.97% of the genetic variability is present among and within populations, respectively. The Pearson correlation coefficient (r) among the genetic distances matrix (1 - Jaccard similarity index) and the geographic distances were estimated, and a strong positive correlation was detected. Results suggest that these populations are differentiating through a stochastic process, with restricted and geographic distribution dependent gene flow.

Download Full-text

Adapting Gloss Vector Semantic Relatedness Measure for Semantic Similarity Estimation: An Evaluation in the Biomedical Domain

Semantic Technology - Lecture Notes in Computer Science ◽

10.1007/978-3-319-14122-0_11 ◽

2014 ◽

pp. 129-145 ◽

Cited By ~ 4

Author(s):

Ahmad Pesaranghader ◽

Azadeh Rezaei ◽

Ali Pesaranghader

Keyword(s):

Semantic Similarity ◽

Semantic Relatedness ◽

Biomedical Domain ◽

Similarity Estimation

Download Full-text

Streaming histogram sketching for rapid microbiome analytics

10.1101/408070 ◽

2018 ◽

Author(s):

Will P. M. Rowe ◽

Anna Paola Carrieri ◽

Cristina Alcon-Giner ◽

Shabhonam Caim ◽

Alex Shaw ◽

...

Keyword(s):

Locality Sensitive Hashing ◽

Genomic Research ◽

Compact Representation ◽

Sample Type ◽

Sequencing Data ◽

Similarity Estimation ◽

Microbiome Research ◽

Microbiome Data ◽

Similarity Searches

AbstractMotivationThe growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research; allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching, and classification of microbiome samples in near real-time.ResultsWe apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can be used to efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we show that histosketches can be used to train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a Random Forest Classifier that could accurately predict whether the neonate had received antibiotic treatment (95% accuracy, precision 97%) and could subsequently be used to classify microbiome data streams in less than 12 seconds.We provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2GB microbiome in 50 seconds on a standard laptop using 4 cores, with the sketch occupying 3000 bytes of disk space.AvailabilityOur implementation (HULK) is written in Go and is available at: https://github.com/will-rowe/hulk (MIT License)

Download Full-text

Clinical Context–Aware Biomedical Text Summarization Using Deep Neural Network: Model Development and Validation (Preprint)

10.2196/preprints.19810 ◽

2020 ◽

Author(s):

Muhammad Afzal ◽

Fakhare Alam ◽

Khalid Mahmood Malik ◽

Ghaus M Malik

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Neural Network ◽

Text Summarization ◽

Biomedical Literature ◽

Biomedical Text ◽

Context Aware ◽

Clinical Context ◽

Jaccard Similarity ◽

Recognition Model

BACKGROUND Automatic text summarization (ATS) enables users to retrieve meaningful evidence from big data of biomedical repositories to make complex clinical decisions. Deep neural and recurrent networks outperform traditional machine-learning techniques in areas of natural language processing and computer vision; however, they are yet to be explored in the ATS domain, particularly for medical text summarization. OBJECTIVE Traditional approaches in ATS for biomedical text suffer from fundamental issues such as an inability to capture clinical context, quality of evidence, and purpose-driven selection of passages for the summary. We aimed to circumvent these limitations through achieving precise, succinct, and coherent information extraction from credible published biomedical resources, and to construct a simplified summary containing the most informative content that can offer a review particular to clinical needs. METHODS In our proposed approach, we introduce a novel framework, termed Biomed-Summarizer, that provides quality-aware Patient/Problem, Intervention, Comparison, and Outcome (PICO)-based intelligent and context-enabled summarization of biomedical text. Biomed-Summarizer integrates the prognosis quality recognition model with a clinical context–aware model to locate text sequences in the body of a biomedical article for use in the final summary. First, we developed a deep neural network binary classifier for quality recognition to acquire scientifically sound studies and filter out others. Second, we developed a bidirectional long-short term memory recurrent neural network as a clinical context–aware classifier, which was trained on semantically enriched features generated using a word-embedding tokenizer for identification of meaningful sentences representing PICO text sequences. Third, we calculated the similarity between query and PICO text sequences using Jaccard similarity with semantic enrichments, where the semantic enrichments are obtained using medical ontologies. Last, we generated a representative summary from the high-scoring PICO sequences aggregated by study type, publication credibility, and freshness score. RESULTS Evaluation of the prognosis quality recognition model using a large dataset of biomedical literature related to intracranial aneurysm showed an accuracy of 95.41% (2562/2686) in terms of recognizing quality articles. The clinical context–aware multiclass classifier outperformed the traditional machine-learning algorithms, including support vector machine, gradient boosted tree, linear regression, K-nearest neighbor, and naïve Bayes, by achieving 93% (16127/17341) accuracy for classifying five categories: aim, population, intervention, results, and outcome. The semantic similarity algorithm achieved a significant Pearson correlation coefficient of 0.61 (0-1 scale) on a well-known BIOSSES dataset (with 100 pair sentences) after semantic enrichment, representing an improvement of 8.9% over baseline Jaccard similarity. Finally, we found a highly positive correlation among the evaluations performed by three domain experts concerning different metrics, suggesting that the automated summarization is satisfactory. CONCLUSIONS By employing the proposed method Biomed-Summarizer, high accuracy in ATS was achieved, enabling seamless curation of research evidence from the biomedical literature to use for clinical decision-making.

Download Full-text