scholarly journals Adaptive Fuzzy String Matching: How to Merge Datasets with Only One (Messy) Identifying Field

2021 ◽  
pp. 1-7
Author(s):  
Aaron R. Kaufman ◽  
Aja Klevs

Abstract A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.

Database ◽  
2019 ◽  
Vol 2019 ◽  
Author(s):  
Andrea Hatlen ◽  
Mohab Helmy ◽  
Antonio Marco

Abstract There is an increasing interest in the study of polymorphic variants at gene regulatory motifs, including microRNA target sites. Understanding the effects of selective forces at specific microRNA target sites, together with other factors like expression levels or evolutionary conservation, requires the joint study of multiple datasets. We have compiled information from multiple sources and compared it with predicted microRNA target sites to build a comprehensive database for the study of microRNA targets in human populations. PopTargs is a web-based tool that allows the easy extraction of multiple datasets and the joint analyses of them, including allele frequencies, ancestral status, population differentiation statistics and site conservation. The user can also compare the allele frequency spectrum between two groups of target sites and conveniently produce plots. The database can be easily expanded as new data becomes available and the raw database as well as code for creating new custom-made databases is available for downloading. We also describe a few illustrative examples.


Author(s):  
Yangjun Chen

In computer engineering, a number of programming tasks involve a special problem, the so-called tree matching problem (Cole & Hariharan, 1997), as a crucial step, such as the design of interpreters for nonprocedural programming languages, automatic implementation of abstract data types, code optimization in compilers, symbolic computation, context searching in structure editors and automatic theorem proving. Recently, it has been shown that this problem can be transformed in linear time to another problem, the so called subset matching problem (Cole & Hariharan, 2002, 2003), which is to find all occurrences of a pattern string p of length m in a text string t of length n, where each pattern and text position is a set of characters drawn from some alphabet S. The pattern is said to occur at text position i if the set p[j] is a subset of the set t[i + j - 1], for all j (1 = j = m). This is a generalization of the ordinary string matching and is of interest since an efficient algorithm for this problem implies an efficient solution to the tree matching problem. In addition, as shown in (Indyk, 1997), this problem can also be used to solve general string matching and counting matching (Muthukrishan, 1997; Muthukrishan & Palem, 1994), and enables us to design efficient algorithms for several geometric pattern matching problems. In this article, we propose a new algorithm on this issue, which needs only O(n + m) time in the case that the size of S is small and O(n + m·n0.5) time on average in general cases.


2020 ◽  
Author(s):  
Jeff de La Beaujardiere

<p>The geosciences are facing a Big Data problem, particularly in the areas of data Volume (huge observational datasets and numerical model outputs), Variety (large numbers of disparate datasets from multiple sources with inconsistent standards), and Velocity (need for rapid processing of continuous data streams). These challenges make it difficult to perform scientific research and to make decisions about serious environmental issues facing our planet. We need to enable science at the scale of our large, disparate, and continuous data.</p><p>One part of the solution relates to infrastructure, such as by making large datasets available in a shared environment co-located with computational resources so that we can bring the analysis code to the data instead of copying data. The other part relies on improvements in metadata, data models, semantics, and collaboration. Individual datasets must have comprehensive, accurate, and machine-readable metadata to enable assessment of their relevance to a specific problem. Multiple datasets must be mapped into an overarching data model rooted in the geographical and temporal attributes to enable us to seamlessly find and access data for the appropriate location and time. Semantic mapping is necessary to enable data from different disciplines to be brought to bear on the same problem. Progress in all these areas will require collaboration on technical methods, interoperability standards, and analysis software that bridges information communities -- collaboration driven by a willingness to make data usable by those outside of the original scientific discipline.</p>


2017 ◽  
Vol 27 (11) ◽  
pp. 3304-3324 ◽  
Author(s):  
Luca Bonomi ◽  
Xiaoqian Jiang

Modern medical research relies on multi-institutional collaborations which enhance the knowledge discovery and data reuse. While these collaborations allow researchers to perform analytics otherwise impossible on individual datasets, they often pose significant challenges in the data integration process. Due to the lack of a unique identifier, data integration solutions often have to rely on patient’s protected health information (PHI). In many situations, such information cannot leave the institutions or must be strictly protected. Furthermore, the presence of noisy values for these attributes may result in poor overall utility. While much research has been done to address these challenges, most of the current solutions are designed for a static setting without considering the temporal information of the data (e.g. EHR). In this work, we propose a novel approach that uses non-PHI for linking patient longitudinal data. Specifically, our technique captures the diagnosis dependencies using patterns which are shown to provide important indications for linking patient records. Our solution can be used as a standalone technique to perform temporal record linkage using non-protected health information data or it can be combined with Privacy Preserving Record Linkage solutions (PPRL) when protected health information is available. In this case, our approach can solve ambiguities in results. Experimental evaluations on real datasets demonstrate the effectiveness of our technique.


1995 ◽  
Vol 2 (46) ◽  
Author(s):  
Dany Breslauer ◽  
Livio Colussi ◽  
Laura Toniolo

In this paper we study the exact comparison complexity of the string<br />prefix-matching problem in the deterministic sequential comparison model<br />with equality tests. We derive almost tight lower and upper bounds on<br />the number of symbol comparisons required in the worst case by on-line<br />prefix-matching algorithms for any fixed pattern and variable text. Unlike<br />previous results on the comparison complexity of string-matching and<br />prefix-matching algorithms, our bounds are almost tight for any particular pattern.<br />We also consider the special case where the pattern and the text are the<br />same string. This problem, which we call the string self-prefix problem, is<br />similar to the pattern preprocessing step of the Knuth-Morris-Pratt string-matching<br />algorithm that is used in several comparison efficient string-matching<br />and prefix-matching algorithms, including in our new algorithm.<br />We obtain roughly tight lower and upper bounds on the number of symbol<br />comparisons required in the worst case by on-line self-prefix algorithms.<br />Our algorithms can be implemented in linear time and space in the<br />standard uniform-cost random-access-machine model.


2014 ◽  
Vol 513-517 ◽  
pp. 1017-1020
Author(s):  
Bing Liu ◽  
Dan Han ◽  
Shuang Zhang

String matching is one of the most typical problems in computer science. Previous studies mainly focused on accurate string matching problem. However, with the rapid development of the computer and Internet as well as the continuously rising of new issues, people find that it has very important theoretical value and practical meaning to research and design efficient approximate string matching algorithms. Approximate string matching is also called string matching that allows errors, which mainly aims to find the pattern string in the text and database and allows k differences between the pattern string and its occurring forms in the text. For the problem of approximate string matching, though a number of algorithms have been proposed, there are fewer studies which focus on large size of alphabet . Most of experts are interested in small or middle size of alphabet . For large size of , especially for Chinese characters and Asian phonetics, there are fewer efficient algorithms. For the above reasons, this paper focuses on the approximate Chinese strings matching problem based on the pinyin input method.


2013 ◽  
Vol 45 (2) ◽  
pp. 1-42 ◽  
Author(s):  
Simone Faro ◽  
Thierry Lecroq

Sign in / Sign up

Export Citation Format

Share Document