Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field

Abstract A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.

Download Full-text

PopTargs: a database for studying population evolutionary genetics of human microRNA target sites

Database ◽

10.1093/database/baz102 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 2

Author(s):

Andrea Hatlen ◽

Mohab Helmy ◽

Antonio Marco

Keyword(s):

Evolutionary Genetics ◽

Human Populations ◽

Microrna Target ◽

Multiple Sources ◽

Web Based ◽

Regulatory Motifs ◽

Multiple Datasets ◽

Custom Made ◽

Target Sites ◽

Selective Forces

Abstract There is an increasing interest in the study of polymorphic variants at gene regulatory motifs, including microRNA target sites. Understanding the effects of selective forces at specific microRNA target sites, together with other factors like expression levels or evolutionary conservation, requires the joint study of multiple datasets. We have compiled information from multiple sources and compared it with predicted microRNA target sites to build a comprehensive database for the study of microRNA targets in human populations. PopTargs is a web-based tool that allows the easy extraction of multiple datasets and the joint analyses of them, including allele frequencies, ancestral status, population differentiation statistics and site conservation. The user can also compare the allele frequency spectrum between two groups of target sites and conveniently produce plots. The database can be easily expanded as new data becomes available and the raw database as well as code for creating new custom-made databases is available for downloading. We also describe a few illustrative examples.

Download Full-text

A New Algorithm for Subset Matching Problem Based on Set-String Transformation

Encyclopedia of Information Communication Technology ◽

10.4018/978-1-59904-845-1.ch080 ◽

2009 ◽

pp. 607-615

Author(s):

Yangjun Chen

Keyword(s):

Programming Languages ◽

Linear Time ◽

String Matching ◽

Special Problem ◽

Computer Engineering ◽

Data Types ◽

Matching Problem ◽

Matching Problems ◽

Abstract Data ◽

Geometric Pattern Matching

In computer engineering, a number of programming tasks involve a special problem, the so-called tree matching problem (Cole & Hariharan, 1997), as a crucial step, such as the design of interpreters for nonprocedural programming languages, automatic implementation of abstract data types, code optimization in compilers, symbolic computation, context searching in structure editors and automatic theorem proving. Recently, it has been shown that this problem can be transformed in linear time to another problem, the so called subset matching problem (Cole & Hariharan, 2002, 2003), which is to find all occurrences of a pattern string p of length m in a text string t of length n, where each pattern and text position is a set of characters drawn from some alphabet S. The pattern is said to occur at text position i if the set p[j] is a subset of the set t[i + j - 1], for all j (1 = j = m). This is a generalization of the ordinary string matching and is of interest since an efficient algorithm for this problem implies an efficient solution to the tree matching problem. In addition, as shown in (Indyk, 1997), this problem can also be used to solve general string matching and counting matching (Muthukrishan, 1997; Muthukrishan & Palem, 1994), and enables us to design efficient algorithms for several geometric pattern matching problems. In this article, we propose a new algorithm on this issue, which needs only O(n + m) time in the case that the size of S is small and O(n + m·n0.5) time on average in general cases.

Download Full-text

A Composite Boyer-Moore Algorithm for the String Matching Problem

2010 International Conference on Parallel and Distributed Computing, Applications and Technologies ◽

10.1109/pdcat.2010.58 ◽

2010 ◽

Cited By ~ 4

Author(s):

Zhengda Xiong

Keyword(s):

String Matching ◽

Matching Problem

Download Full-text

Enabling Science at Scale

10.5194/egusphere-egu2020-12079 ◽

2020 ◽

Author(s):

Jeff de La Beaujardiere

Keyword(s):

Scientific Discipline ◽

Shared Environment ◽

Continuous Data ◽

Multiple Sources ◽

Multiple Datasets ◽

Large Numbers ◽

Data Volume ◽

Computational Resources ◽

Machine Readable ◽

Rapid Processing

The geosciences are facing a Big Data problem, particularly in the areas of data Volume (huge observational datasets and numerical model outputs), Variety (large numbers of disparate datasets from multiple sources with inconsistent standards), and Velocity (need for rapid processing of continuous data streams). These challenges make it difficult to perform scientific research and to make decisions about serious environmental issues facing our planet. We need to enable science at the scale of our large, disparate, and continuous data.One part of the solution relates to infrastructure, such as by making large datasets available in a shared environment co-located with computational resources so that we can bring the analysis code to the data instead of copying data. The other part relies on improvements in metadata, data models, semantics, and collaboration. Individual datasets must have comprehensive, accurate, and machine-readable metadata to enable assessment of their relevance to a specific problem. Multiple datasets must be mapped into an overarching data model rooted in the geographical and temporal attributes to enable us to seamlessly find and access data for the appropriate location and time. Semantic mapping is necessary to enable data from different disciplines to be brought to bear on the same problem. Progress in all these areas will require collaboration on technical methods, interoperability standards, and analysis software that bridges information communities -- collaboration driven by a willingness to make data usable by those outside of the original scientific discipline.

Download Full-text

Linking temporal medical records using non-protected health information data

Statistical Methods in Medical Research ◽

10.1177/0962280217698005 ◽

2017 ◽

Vol 27 (11) ◽

pp. 3304-3324 ◽

Cited By ~ 1

Author(s):

Luca Bonomi ◽

Xiaoqian Jiang

Keyword(s):

Data Integration ◽

Health Information ◽

Record Linkage ◽

Medical Records ◽

Data Reuse ◽

Patient Records ◽

Protected Health Information ◽

Integration Process ◽

Unique Identifier ◽

Novel Approach

Modern medical research relies on multi-institutional collaborations which enhance the knowledge discovery and data reuse. While these collaborations allow researchers to perform analytics otherwise impossible on individual datasets, they often pose significant challenges in the data integration process. Due to the lack of a unique identifier, data integration solutions often have to rely on patient’s protected health information (PHI). In many situations, such information cannot leave the institutions or must be strictly protected. Furthermore, the presence of noisy values for these attributes may result in poor overall utility. While much research has been done to address these challenges, most of the current solutions are designed for a static setting without considering the temporal information of the data (e.g. EHR). In this work, we propose a novel approach that uses non-PHI for linking patient longitudinal data. Specifically, our technique captures the diagnosis dependencies using patterns which are shown to provide important indications for linking patient records. Our solution can be used as a standalone technique to perform temporal record linkage using non-protected health information data or it can be combined with Privacy Preserving Record Linkage solutions (PPRL) when protected health information is available. In this case, our approach can solve ambiguities in results. Experimental evaluations on real datasets demonstrate the effectiveness of our technique.

Download Full-text

On the Comparison Complexity of the String Prefix-Matching Problem

BRICS Report Series ◽

10.7146/brics.v2i46.19947 ◽

1995 ◽

Vol 2 (46) ◽

Author(s):

Dany Breslauer ◽

Livio Colussi ◽

Laura Toniolo

Keyword(s):

Linear Time ◽

String Matching ◽

Random Access ◽

Upper Bounds ◽

Lower And Upper Bounds ◽

Matching Problem ◽

Machine Model ◽

Worst Case ◽

On Line ◽

Sequential Comparison

In this paper we study the exact comparison complexity of the string prefix-matching problem in the deterministic sequential comparison model with equality tests. We derive almost tight lower and upper bounds on the number of symbol comparisons required in the worst case by on-line prefix-matching algorithms for any fixed pattern and variable text. Unlike previous results on the comparison complexity of string-matching and prefix-matching algorithms, our bounds are almost tight for any particular pattern. We also consider the special case where the pattern and the text are the same string. This problem, which we call the string self-prefix problem, is similar to the pattern preprocessing step of the Knuth-Morris-Pratt string-matching algorithm that is used in several comparison efficient string-matching and prefix-matching algorithms, including in our new algorithm. We obtain roughly tight lower and upper bounds on the number of symbol comparisons required in the worst case by on-line self-prefix algorithms. Our algorithms can be implemented in linear time and space in the standard uniform-cost random-access-machine model.

Download Full-text

Approximate Chinese String Matching Techniques Based on Pinyin Input Method

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.1017 ◽

2014 ◽

Vol 513-517 ◽

pp. 1017-1020

Author(s):

Bing Liu ◽

Dan Han ◽

Shuang Zhang

Keyword(s):

Computer Science ◽

Rapid Development ◽

String Matching ◽

Approximate String Matching ◽

Chinese Characters ◽

Matching Problem ◽

Input Method ◽

Large Size ◽

Matching Techniques ◽

Research And Design

String matching is one of the most typical problems in computer science. Previous studies mainly focused on accurate string matching problem. However, with the rapid development of the computer and Internet as well as the continuously rising of new issues, people find that it has very important theoretical value and practical meaning to research and design efficient approximate string matching algorithms. Approximate string matching is also called string matching that allows errors, which mainly aims to find the pattern string in the text and database and allows k differences between the pattern string and its occurring forms in the text. For the problem of approximate string matching, though a number of algorithms have been proposed, there are fewer studies which focus on large size of alphabet . Most of experts are interested in small or middle size of alphabet . For large size of , especially for Chinese characters and Asian phonetics, there are fewer efficient algorithms. For the above reasons, this paper focuses on the approximate Chinese strings matching problem based on the pinyin input method.

Download Full-text