string similarity Latest Research Papers

A Tiling Algorithm-Based String Similarity Measure

WSEAS TRANSACTIONS ON COMPUTER RESEARCH ◽

10.37394/232018.2021.9.13 ◽

2021 ◽

Vol 9 ◽

pp. 109-112

Author(s):

Peter Z. Revesz

Keyword(s):

Amino Acid ◽

Gene Transfer ◽

Horizontal Gene Transfer ◽

Similarity Measure ◽

Amino Acid Sequences ◽

String Similarity

This paper describes a similarity measure for strings based on a tiling algorithm. The algorithm is applied to a pair of proteins that are described by their respective amino acid sequences. The paper also describes how the algorithm can be used to find highly conserved amino acid sequences and examples of horizontal gene transfer between different species

Download Full-text

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

JAMIA Open ◽

10.1093/jamiaopen/ooab085 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Briton Park ◽

Nicholas Altieri ◽

John DeNero ◽

Anobel Y Odisho ◽

Bin Yu

Keyword(s):

Natural Language ◽

Information Extraction ◽

Transfer Learning ◽

Language Processing ◽

Training Data ◽

Accurate Information ◽

Pathology Report ◽

Learning Methods ◽

String Similarity ◽

Pathology Reports

Abstract Objective We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. Materials and Methods Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. Results For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. Conclusions Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

Download Full-text

WEB APP: String Similarity Search - A Hash-based Approach

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.34561 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 2380-2386

Author(s):

Snehal Bobhate

Keyword(s):

Similarity Search ◽

High Efficiency ◽

Edit Distance ◽

Management Systems ◽

Construction Time ◽

String Similarity ◽

Order Of Magnitude ◽

Query String ◽

Web App ◽

Index Size

During this Project, we study string similarity search based on edit distance that is supported by many database management systems like Oracle and PostgreSQL. Given the edit distance, ed(s, t), between two strings, s and t, the string similarity search is to search out each string t in a string database D which is almost like a query string s such that ed(s, t) = t for a given threshold t. Within the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of 2 strings by utilizing an index engineered offline for D. The two up-to-date approaches are prefix filtering and native filtering. We have a tendency to propose 2 new hash- primarily based labeling techniques, named OX label and XX label, for string similarity search. We have a tendency to assign a hash-label, H s , to a string s, and prune the dissimilar strings by comparing 2 hash-labels, H s and H t , for two strings s and t within the filter step. The key idea is to take the dissimilar bit- patterns between 2 hash-labels.Our hash-based mostly approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the present approaches in our experiment at the same time.

Download Full-text

Sistem Question Answering untuk Bahasa Bali menggunakan Metode Rule-Based dan String Similarity

Techno Com ◽

10.33633/tc.v20i2.4390 ◽

2021 ◽

Vol 20 (2) ◽

pp. 300-308

Author(s):

Made Agus Putra Subali ◽

Puritan Wijaya

Keyword(s):

Question Answering ◽

Rule Based ◽

String Similarity ◽

Bahasa Indonesia

Sistem question answering merupakan kemampuan sistem untuk memberikan jawaban atas kalimat tanya yang diberikan oleh user. Sampai saat ini penelitian tentang sistem question answering pada bahasa Bali belum pernah dilakukan. Pada penelitian ini kalimat tanya yang digunakan adalah kalimat tanya biasa, sebagai contoh kalimat tanya "akuda memene ngubuh siap?" Dalam bahasa Indonesia "berapa ibumu memelihara ayam?" Data yang digunakan dalam penelitian ini merupakan lima puluh dokumen berbahasa Bali. Sedangkan pada tahap pengujian dilakukan dengan menggunakan dua puluh kalimat tanya. Adapun metode yang diusulkan dalam penelitian ini dimulai dari memberi input pertanyaan, mencari dokumen paling relevan berdasarkan pertanyaan yang diberikan, dan memperoleh jawaban berdasarkan rules untuk setiap pertanyaan. Berdasarkan pengujian pada kedua puluh pertanyaan yang diberikan metode yang diusulkan memperoleh akurasi sebesar 40% terkait kebenaran respons atau jawaban yang diberikan.

Download Full-text

Analisis Pengaruh Teks Preprocessing Terhadap Deteksi Plagiarisme Pada Dokumen Tugas Akhir

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v6i3.2892 ◽

2020 ◽

Vol 6 (3) ◽

Author(s):

Ariel Elbert Budiman ◽

Andreas Widjaja

Keyword(s):

Full Text ◽

Cosine Similarity ◽

Experimental Results ◽

String Similarity ◽

Project Report ◽

Final Project ◽

Text Report ◽

Similarity Ratio ◽

Text Preprocessing ◽

Comparison Of The Results

Final Project Report at a university has the potential for plagiarism. To detect possible plagiarism, String Similarity can be used. Text preprocessing is needed to process words which can make String Similarity results inaccurate. The value of the distribution of the results of the similarity that is getting higher shows the level of accuracy is also getting higher. Reports that contain many words can make it difficult to find plagiarism recommendations. In this study, we try to divide the report into each chapter to provide more detailed recommendation material. By using text preprocessing and comparison methods in the same chapter, can determine the characteristics of each chapter. The discovery of the characteristics of each chapter can be used as plagiarism recommendation material in more detail than a full text report. The experiment was a comparison of the results of cosine similarity between the same chapters and full text, then combined with preprocessing stopword removal and stemming. The experimental results show that the use of preprocessing stopword removal and stemming can produce the highest distribution value and the similarity ratio in each chapter can show its characteristics. Words that represent the characteristics of a chapter can potentially become a stopword.

Download Full-text

Handling data-skewness in character based string similarity join using Hadoop

Applied Computing and Informatics ◽

10.1016/j.aci.2018.11.001 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 1

Author(s):

Kanak Meena ◽

Devendra K. Tayal ◽

Oscar Castillo ◽

Amita Jain

Keyword(s):

Scientific Data ◽

Distribution Law ◽

Similarity Join ◽

String Similarity ◽

Zipf Distribution ◽

Imbalance Problem ◽

Data Skewness ◽

Pair Generation ◽

Set Up ◽

Similarity Joins

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Download Full-text

Overlap Model Code

10.31234/osf.io/67g2f ◽

2020 ◽

Cited By ~ 1

Author(s):

Pablo Gomez

Keyword(s):

String Similarity ◽

Model Code ◽

Overlap Model

This project includes the code to generate string similarity estimates in the overlap model

Download Full-text

Top-k String Similarity Joins

32nd International Conference on Scientific and Statistical Database Management ◽

10.1145/3400903.3400922 ◽

2020 ◽

Author(s):

Shuyao Qi ◽

Panagiotis Bouros ◽

Nikos Mamoulis

Keyword(s):

String Similarity ◽

Similarity Joins

Download Full-text

Matching Large Scale Ontologies Based on Filter and Verification

Mathematical Problems in Engineering ◽

10.1155/2020/8107968 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Yingxin Li ◽

Zhou Jianhui ◽

Jihong Liu ◽

Yongzhu Hou

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Heterogeneous Systems ◽

Structural Similarity ◽

Recall Rate ◽

Experimental Result ◽

Ontology Matching ◽

Accuracy Rate ◽

String Similarity ◽

Innovative Method

Ontology matching is an effective method to realize intercommunication and interoperability between heterogeneous systems. The essence of ontology matching is to discover the similar entity pairs between source ontology and target ontology, which is a process calculating the similarity between entities in ontologies. The similarity can be calculated utilizing various features between entity pairs, such as string similarity, structural similarity, and semantic similarity. The larger the ontology scale, the lower the efficiency and accuracy rate of ontology matching. As the ontology scale increases, the amount of entities in ontologies will be larger and the ontologies will become more heterogeneous. This paper proposes an innovative method of matching large scale ontologies based on filter and verification, which firstly reduces the heterogeneous of large scale ontologies in the filter phase and then matches the reduced ontologies in the verification phase. Large scale ontologies will be partitioned into several subontologies to get a proper scale before matching. The benchmark of Anatomy and Food in OAEI is adopted to evaluate the proposed method, and the experimental result illuminates that the recall rate is improved in the situation of retaining efficiency and accuracy rate using the proposed method.

Download Full-text

Explaining Propagators for String Edit Distance Constraints

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i02.5530 ◽

2020 ◽

Vol 34 (02) ◽

pp. 1676-1683

Author(s):

Felix Winter ◽

Nysret Musliu ◽

Peter Stuckey

Keyword(s):

Edit Distance ◽

Similarity Measures ◽

Benchmark Problems ◽

Distance Constraint ◽

Exact Methods ◽

Solution Quality ◽

String Similarity ◽

Practical Applications ◽

String Edit Distance ◽

String Similarity Measures

The computation of string similarity measures has been thoroughly studied in the scientific literature and has applications in a wide variety of different areas. One of the most widely used measures is the so called string edit distance which captures the number of required edit operations to transform a string into another given string. Although polynomial time algorithms are known for calculating the edit distance between two strings, there also exist NP-hard problems from practical applications like scheduling or computational biology that constrain the minimum edit distance between arrays of decision variables. In this work, we propose a novel global constraint to formulate restrictions on the minimum edit distance for such problems. Furthermore, we describe a propagation algorithm and investigate an explanation strategy for an edit distance constraint propagator that can be incorporated into state of the art lazy clause generation solvers. Experimental results show that the proposed propagator is able to significantly improve the performance of existing exact methods regarding solution quality and computation speed for benchmark problems from the literature.

Download Full-text

string similarity
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Tiling Algorithm-Based String Similarity Measure

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

WEB APP: String Similarity Search - A Hash-based Approach

Sistem Question Answering untuk Bahasa Bali menggunakan Metode Rule-Based dan String Similarity

Analisis Pengaruh Teks Preprocessing Terhadap Deteksi Plagiarisme Pada Dokumen Tugas Akhir

Handling data-skewness in character based string similarity join using Hadoop

Overlap Model Code

Top-k String Similarity Joins

Matching Large Scale Ontologies Based on Filter and Verification

Explaining Propagators for String Edit Distance Constraints

Export Citation Format

string similarityRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Tiling Algorithm-Based String Similarity Measure

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity

WEB APP: String Similarity Search - A Hash-based Approach

Sistem Question Answering untuk Bahasa Bali menggunakan Metode Rule-Based dan String Similarity

Analisis Pengaruh Teks Preprocessing Terhadap Deteksi Plagiarisme Pada Dokumen Tugas Akhir

Handling data-skewness in character based string similarity join using Hadoop

Overlap Model Code

Top-k String Similarity Joins

Matching Large Scale Ontologies Based on Filter and Verification

Explaining Propagators for String Edit Distance Constraints

string similarity
Recently Published Documents