string similarity
Recently Published Documents


TOTAL DOCUMENTS

147
(FIVE YEARS 37)

H-INDEX

20
(FIVE YEARS 3)

2021 ◽  
Vol 9 ◽  
pp. 109-112
Author(s):  
Peter Z. Revesz

This paper describes a similarity measure for strings based on a tiling algorithm. The algorithm is applied to a pair of proteins that are described by their respective amino acid sequences. The paper also describes how the algorithm can be used to find highly conserved amino acid sequences and examples of horizontal gene transfer between different species


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Briton Park ◽  
Nicholas Altieri ◽  
John DeNero ◽  
Anobel Y Odisho ◽  
Bin Yu

Abstract Objective We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. Materials and Methods Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. Results For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. Conclusions Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.


Author(s):  
Snehal Bobhate

During this Project, we study string similarity search based on edit distance that is supported by many database management systems like Oracle and PostgreSQL. Given the edit distance, ed(s, t), between two strings, s and t, the string similarity search is to search out each string t in a string database D which is almost like a query string s such that ed(s, t) = t for a given threshold t. Within the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of 2 strings by utilizing an index engineered offline for D. The two up-to-date approaches are prefix filtering and native filtering. We have a tendency to propose 2 new hash- primarily based labeling techniques, named OX label and XX label, for string similarity search. We have a tendency to assign a hash-label, H s , to a string s, and prune the dissimilar strings by comparing 2 hash-labels, H s and H t , for two strings s and t within the filter step. The key idea is to take the dissimilar bit- patterns between 2 hash-labels.Our hash-based mostly approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the present approaches in our experiment at the same time.


Techno Com ◽  
2021 ◽  
Vol 20 (2) ◽  
pp. 300-308
Author(s):  
Made Agus Putra Subali ◽  
Puritan Wijaya

Sistem question answering merupakan kemampuan sistem untuk memberikan jawaban atas kalimat tanya yang diberikan oleh user. Sampai saat ini penelitian tentang sistem question answering pada bahasa Bali belum pernah dilakukan. Pada penelitian ini kalimat tanya yang digunakan adalah kalimat tanya biasa, sebagai contoh kalimat tanya "akuda memene ngubuh siap?" Dalam bahasa Indonesia "berapa ibumu memelihara ayam?" Data yang digunakan dalam penelitian ini merupakan lima puluh dokumen berbahasa Bali. Sedangkan pada tahap pengujian dilakukan dengan menggunakan dua puluh kalimat tanya. Adapun metode yang diusulkan dalam penelitian ini dimulai dari memberi input pertanyaan, mencari dokumen paling relevan berdasarkan pertanyaan yang diberikan, dan memperoleh jawaban berdasarkan rules untuk setiap pertanyaan. Berdasarkan pengujian pada kedua puluh pertanyaan yang diberikan metode yang diusulkan memperoleh akurasi sebesar 40% terkait kebenaran respons atau jawaban yang diberikan.


Author(s):  
Ariel Elbert Budiman ◽  
Andreas Widjaja

Final Project Report at a university has the potential for plagiarism. To detect possible plagiarism, String Similarity can be used. Text preprocessing is needed to process words which can make String Similarity results inaccurate. The value of the distribution of the results of the similarity that is getting higher shows the level of accuracy is also getting higher. Reports that contain many words can make it difficult to find plagiarism recommendations. In this study, we try to divide the report into each chapter to provide more detailed recommendation material. By using text preprocessing and comparison methods in the same chapter, can determine the characteristics of each chapter. The discovery of the characteristics of each chapter can be used as plagiarism recommendation material in more detail than a full text report. The experiment was a comparison of the results of cosine similarity between the same chapters and full text, then combined with preprocessing stopword removal and stemming. The experimental results show that the use of preprocessing stopword removal and stemming can produce the highest distribution value and the similarity ratio in each chapter can show its characteristics. Words that represent the characteristics of a chapter can potentially become a stopword.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Kanak Meena ◽  
Devendra K. Tayal ◽  
Oscar Castillo ◽  
Amita Jain

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.


2020 ◽  
Author(s):  
Pablo Gomez

This project includes the code to generate string similarity estimates in the overlap model


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Yingxin Li ◽  
Zhou Jianhui ◽  
Jihong Liu ◽  
Yongzhu Hou

Ontology matching is an effective method to realize intercommunication and interoperability between heterogeneous systems. The essence of ontology matching is to discover the similar entity pairs between source ontology and target ontology, which is a process calculating the similarity between entities in ontologies. The similarity can be calculated utilizing various features between entity pairs, such as string similarity, structural similarity, and semantic similarity. The larger the ontology scale, the lower the efficiency and accuracy rate of ontology matching. As the ontology scale increases, the amount of entities in ontologies will be larger and the ontologies will become more heterogeneous. This paper proposes an innovative method of matching large scale ontologies based on filter and verification, which firstly reduces the heterogeneous of large scale ontologies in the filter phase and then matches the reduced ontologies in the verification phase. Large scale ontologies will be partitioned into several subontologies to get a proper scale before matching. The benchmark of Anatomy and Food in OAEI is adopted to evaluate the proposed method, and the experimental result illuminates that the recall rate is improved in the situation of retaining efficiency and accuracy rate using the proposed method.


2020 ◽  
Vol 34 (02) ◽  
pp. 1676-1683
Author(s):  
Felix Winter ◽  
Nysret Musliu ◽  
Peter Stuckey

The computation of string similarity measures has been thoroughly studied in the scientific literature and has applications in a wide variety of different areas. One of the most widely used measures is the so called string edit distance which captures the number of required edit operations to transform a string into another given string. Although polynomial time algorithms are known for calculating the edit distance between two strings, there also exist NP-hard problems from practical applications like scheduling or computational biology that constrain the minimum edit distance between arrays of decision variables. In this work, we propose a novel global constraint to formulate restrictions on the minimum edit distance for such problems. Furthermore, we describe a propagation algorithm and investigate an explanation strategy for an edit distance constraint propagator that can be incorporated into state of the art lazy clause generation solvers. Experimental results show that the proposed propagator is able to significantly improve the performance of existing exact methods regarding solution quality and computation speed for benchmark problems from the literature.


Sign in / Sign up

Export Citation Format

Share Document